[SERVER-5639] Atomic Integer Implementations Not Compatible with Very Old Processors Created: 17/Apr/12 Updated: 03/Oct/12 Resolved: 17/May/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 2.1.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Spencer Brody (Inactive) | Assignee: | Andy Schwerin |
| Resolution: | Won't Fix | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
A report from a user: I can't start mongod with --directoryperdb option. Last known working version was 2012-04-03, first known with error 2012-04-14
|
| Comments |
| Comment by Ian Whalen (Inactive) [ 03/Oct/12 ] | ||||||||||||||||||
|
John, I've just created a new ticket ( | ||||||||||||||||||
| Comment by John P Masseria [ 03/Oct/12 ] | ||||||||||||||||||
|
Given that this defect for older CPU processors will not be corrected, shouldn't the MongoDB daemon initialization be modified to detect "non-supported" CPUs, display an appropriate error messasge and exit gracefully as opposed to allowing the software to error down the road after executing an illegal instruction? | ||||||||||||||||||
| Comment by Andy Schwerin [ 17/May/12 ] | ||||||||||||||||||
|
Given the age of the processors involved, we're going to hold off on this. | ||||||||||||||||||
| Comment by Andy Schwerin [ 23/Apr/12 ] | ||||||||||||||||||
|
This appears to be the problem. I had not considered pre-x64 era 32-bit x86 processors. They indeed lack mfence. One solution is to use CPUID as the fencing instruction on 32-bit systems. This will make our performance on newer 32-bit x86 machines marginally less than optimal, but the alternatives aren't pretty. Another option is to provide a uniprocessors-only implementation of mongo, that doesn't use any fence in its atomics at all, but refuses to run on multiprocessor systems. That's a lot of work to get underway, so probably not a short term solution. | ||||||||||||||||||
| Comment by Aaron Staple [ 23/Apr/12 ] | ||||||||||||||||||
|
My theory is that this stack trace results from incompatibility of the recently updated AtomicUInt::set implementation with the user's older system and processor. I believe the user reported a startup error using --directoryperdb but no startup error without --directoryperdb because all their data files were in the --directoryperdb directory configuration and clearTmpCollections() did not trigger any queries without --directoryperdb because no data files were encountered. The user stated they were unable to use the database after startup in non --directoryperdb mode though did not provide a server stack trace. I believe the problem is not related to the --directoryperdb configuration. I believe the illegal instruction signal may be coming from the AtomicUInt::set() call made by BSONObj::copy(), which is called from BSONObj::getOwned() when the MultiPlanScanner's constructor calls query.getOwned(). I did a non definitive investigation of the environment in which the binary the user ran was compiled. On the 32 bit buildbot machine g++ -v reports
Compiling a test program
results in
This suggests to me that binaries coming from this machine will implement AtomicUInt::set() using an mfence instruction. A bit of digging suggests that mfence was added to x86 with sse2. However the user reports running an Athlon XP 2500+ processor, which seems to support sse but not sse2. In addition the date of the change to the AtomicUInt::set() function is consistent with the behavioral change window reported by the user. Note that I have not verified above that the g++ I ran from the command line is the same version used by scons, or checked explicitly that the binary uses the mfence instruction in the specified function, or that the operation address in the nightly build the user tested corresponds to an "advanced" instruction. (The __sync_synchronize() implementation of AtomicUInt::set() might potentially cause this user problems as well, but I haven't investigated that case.) |