[SERVER-5639] Atomic Integer Implementations Not Compatible with Very Old Processors Created: 17/Apr/12  Updated: 03/Oct/12  Resolved: 17/May/12

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 2.1.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Andy Schwerin
Resolution: Won't Fix Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-7012 Invalid operation at address: 0x819b2... Closed
Related
related to SERVER-7251 Detect and handle non-supported CPU o... Closed
Operating System: ALL
Participants:

 Description   

A report from a user:

I can't start mongod with --directoryperdb option. Last known working version was 2012-04-03, first known with error 2012-04-14
without --directoryperdb works fine

|1.9.3| bin/mongodb/bin> ./mongod --dbpath /home/kfl62/.data/db --directoryperdb
db level locking enabled: 1
Tue Apr 17 19:23:28 
Tue Apr 17 19:23:28 warning: 32-bit servers don't have journaling enabled by default. Please use --journal if you want durability.
Tue Apr 17 19:23:28 
Tue Apr 17 19:23:28 [initandlisten] MongoDB starting : pid=19546 port=27017 dbpath=/home/kfl62/.data/db 32-bit host=kfl62
Tue Apr 17 19:23:28 [initandlisten] 
Tue Apr 17 19:23:28 [initandlisten] ** NOTE: This is a development version (2.1.1-pre-) of MongoDB.
Tue Apr 17 19:23:28 [initandlisten] **       Not recommended for production.
Tue Apr 17 19:23:28 [initandlisten] 
Tue Apr 17 19:23:28 [initandlisten] ** NOTE: when using MongoDB 32 bit, you are limited to about 2 gigabytes of data
Tue Apr 17 19:23:28 [initandlisten] **       see http://blog.mongodb.org/post/137788967/32-bit-limitations
Tue Apr 17 19:23:28 [initandlisten] **       with --journal, the limit is lower
Tue Apr 17 19:23:28 [initandlisten] 
Tue Apr 17 19:23:28 [initandlisten] db version v2.1.1-pre-, pdfile version 4.5
Tue Apr 17 19:23:28 [initandlisten] git version: 458f5d6e356bb2ff18fbfb58d3fa069d8edaf6c7
Tue Apr 17 19:23:28 [initandlisten] build info: Linux domU-12-31-39-01-70-B4 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686 BOOST_LIB_VERSION=1_49
Tue Apr 17 19:23:28 [initandlisten] options: { dbpath: "/home/kfl62/.data/db", directoryperdb: true }
Tue Apr 17 19:23:28 Invalid operation at address: 0x818d283 from thread: initandlisten
 
Tue Apr 17 19:23:28 Got signal: 4 (Illegal instruction).
 
Tue Apr 17 19:23:28 Backtrace:
0x8640bda 0x8161012 0x816168f 0xb772640c 0x818d283 0x838cb03 0x853b134 0x853d688 0x853d805 0x87e1d79 0x87e3c1d 0x857cf84 0x857f3e6 0x8580fbf 0x83d6693 0x87aba3b 0x8572d97 0x8165264 0x8165ec5 0x8166bc5 
 ./mongod(_ZN5mongo15printStackTraceERSo+0x2a) [0x8640bda]
 ./mongod(_ZN5mongo10abruptQuitEi+0x3a2) [0x8161012]
 ./mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x2af) [0x816168f]
 [0xb772640c]
 ./mongod(_ZNK5mongo7BSONObj4copyEv+0x33) [0x818d283]
 ./mongod(_ZN5mongo16MultiPlanScannerC1EPKcRKNS_7BSONObjES5_RKN5boost10shared_ptrIKNS_11ParsedQueryEEES5_NS_12QueryPlanSet18RecordedPlanPolicyES5_S5_+0x433) [0x838cb03]
 ./mongod(_ZN5mongo15CursorGenerator19setMultiPlanScannerEv+0x114) [0x853b134]
 ./mongod(_ZN5mongo15CursorGenerator8generateEv+0x78) [0x853d688]
 ./mongod(_ZN5mongo25NamespaceDetailsTransient9getCursorEPKcRKNS_7BSONObjES5_RKNS_24QueryPlanSelectionPolicyEPbRKN5boost10shared_ptrIKNS_11ParsedQueryEEEPNS_16QueryPlanSummaryE+0x65) [0x853d805]
 ./mongod(_ZN5mongo23queryWithQueryOptimizerERNS_7MessageEiPKcRKNS_7BSONObjERNS_5CurOpES6_S6_RKN5boost10shared_ptrINS_11ParsedQueryEEES6_RKNS_17ShardChunkVersionES1_+0xd9) [0x87e1d79]
 ./mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x7ad) [0x87e3c1d]
 ./mongod() [0x857cf84]
 ./mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x476) [0x857f3e6]
 ./mongod(_ZN5mongo14DBDirectClient4callERNS_7MessageES2_bPSs+0x7f) [0x8580fbf]
 ./mongod(_ZN5mongo14DBClientCursor4initEv+0xc3) [0x83d6693]
 ./mongod(_ZN5mongo12DBClientBase5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0xbb) [0x87aba3b]
 ./mongod(_ZN5mongo14DBDirectClient5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0x77) [0x8572d97]
 ./mongod(_ZN5mongo19clearTmpCollectionsEv+0x234) [0x8165264]
 ./mongod(_ZN5mongo14_initAndListenEi+0x495) [0x8165ec5]
 ./mongod(_ZN5mongo13initAndListenEi+0x25) [0x8166bc5]
 
Logstream::get called in uninitialized state
Tue Apr 17 19:23:28 [initandlisten] ERROR: Client::~Client _context should be null but is not; client:initandlisten
Logstream::get called in uninitialized state
Tue Apr 17 19:23:28 [initandlisten] ERROR: Client::shutdown not called: initandlisten



 Comments   
Comment by Ian Whalen (Inactive) [ 03/Oct/12 ]

John, I've just created a new ticket (SERVER-7251) so that was can triage and track this request separately.

Comment by John P Masseria [ 03/Oct/12 ]

Given that this defect for older CPU processors will not be corrected, shouldn't the MongoDB daemon initialization be modified to detect "non-supported" CPUs, display an appropriate error messasge and exit gracefully as opposed to allowing the software to error down the road after executing an illegal instruction?

Comment by Andy Schwerin [ 17/May/12 ]

Given the age of the processors involved, we're going to hold off on this.

Comment by Andy Schwerin [ 23/Apr/12 ]

This appears to be the problem. I had not considered pre-x64 era 32-bit x86 processors. They indeed lack mfence.

One solution is to use CPUID as the fencing instruction on 32-bit systems. This will make our performance on newer 32-bit x86 machines marginally less than optimal, but the alternatives aren't pretty.

Another option is to provide a uniprocessors-only implementation of mongo, that doesn't use any fence in its atomics at all, but refuses to run on multiprocessor systems. That's a lot of work to get underway, so probably not a short term solution.

Comment by Aaron Staple [ 23/Apr/12 ]

My theory is that this stack trace results from incompatibility of the recently updated AtomicUInt::set implementation with the user's older system and processor.

I believe the user reported a startup error using --directoryperdb but no startup error without --directoryperdb because all their data files were in the --directoryperdb directory configuration and clearTmpCollections() did not trigger any queries without --directoryperdb because no data files were encountered. The user stated they were unable to use the database after startup in non --directoryperdb mode though did not provide a server stack trace. I believe the problem is not related to the --directoryperdb configuration.

I believe the illegal instruction signal may be coming from the AtomicUInt::set() call made by BSONObj::copy(), which is called from BSONObj::getOwned() when the MultiPlanScanner's constructor calls query.getOwned().

I did a non definitive investigation of the environment in which the binary the user ran was compiled. On the 32 bit buildbot machine g++ -v reports

Using built-in specs.
Target: i386-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --with-cpu=generic --host=i386-redhat-linux
Thread model: posix
gcc version 4.1.2 20070925 (Red Hat 4.1.2-33)

Compiling a test program

#include <iostream>
using namespace std;
 
int main() {
  #if defined(__GCC_HAVE_SYNC_COMPARE_AND_SWAP_4)
  cout << "have __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4" << endl;
  #else
  cout << "don't have __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4" << endl;
  #endif
  return 0;
}

results in

[root@…]# g++ -o test test.cpp && ./test
don't have __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4

This suggests to me that binaries coming from this machine will implement AtomicUInt::set() using an mfence instruction. A bit of digging suggests that mfence was added to x86 with sse2. However the user reports running an Athlon XP 2500+ processor, which seems to support sse but not sse2.

In addition the date of the change to the AtomicUInt::set() function is consistent with the behavioral change window reported by the user.

Note that I have not verified above that the g++ I ran from the command line is the same version used by scons, or checked explicitly that the binary uses the mfence instruction in the specified function, or that the operation address in the nightly build the user tested corresponds to an "advanced" instruction. (The __sync_synchronize() implementation of AtomicUInt::set() might potentially cause this user problems as well, but I haven't investigated that case.)

Generated at Thu Feb 08 03:09:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.