Core Server
  1. Core Server
  2. SERVER-5639

Atomic Integer Implementations Not Compatible with Very Old Processors

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major - P3 Major - P3
    • Resolution: Won't Fix
    • Affects Version/s: 2.1.1
    • Fix Version/s: None
    • Component/s: Stability
    • Labels:
      None
    • Backport:
      No
    • Operating System:
      ALL
    • Bug Type:
      Unknown
    • # Replies:
      5
    • Last comment by Customer:
      false

      Description

      A report from a user:

      I can't start mongod with --directoryperdb option. Last known working version was 2012-04-03, first known with error 2012-04-14
      without --directoryperdb works fine

      |1.9.3| bin/mongodb/bin> ./mongod --dbpath /home/kfl62/.data/db --directoryperdb
      db level locking enabled: 1
      Tue Apr 17 19:23:28 
      Tue Apr 17 19:23:28 warning: 32-bit servers don't have journaling enabled by default. Please use --journal if you want durability.
      Tue Apr 17 19:23:28 
      Tue Apr 17 19:23:28 [initandlisten] MongoDB starting : pid=19546 port=27017 dbpath=/home/kfl62/.data/db 32-bit host=kfl62
      Tue Apr 17 19:23:28 [initandlisten] 
      Tue Apr 17 19:23:28 [initandlisten] ** NOTE: This is a development version (2.1.1-pre-) of MongoDB.
      Tue Apr 17 19:23:28 [initandlisten] **       Not recommended for production.
      Tue Apr 17 19:23:28 [initandlisten] 
      Tue Apr 17 19:23:28 [initandlisten] ** NOTE: when using MongoDB 32 bit, you are limited to about 2 gigabytes of data
      Tue Apr 17 19:23:28 [initandlisten] **       see http://blog.mongodb.org/post/137788967/32-bit-limitations
      Tue Apr 17 19:23:28 [initandlisten] **       with --journal, the limit is lower
      Tue Apr 17 19:23:28 [initandlisten] 
      Tue Apr 17 19:23:28 [initandlisten] db version v2.1.1-pre-, pdfile version 4.5
      Tue Apr 17 19:23:28 [initandlisten] git version: 458f5d6e356bb2ff18fbfb58d3fa069d8edaf6c7
      Tue Apr 17 19:23:28 [initandlisten] build info: Linux domU-12-31-39-01-70-B4 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686 BOOST_LIB_VERSION=1_49
      Tue Apr 17 19:23:28 [initandlisten] options: { dbpath: "/home/kfl62/.data/db", directoryperdb: true }
      Tue Apr 17 19:23:28 Invalid operation at address: 0x818d283 from thread: initandlisten
      
      Tue Apr 17 19:23:28 Got signal: 4 (Illegal instruction).
      
      Tue Apr 17 19:23:28 Backtrace:
      0x8640bda 0x8161012 0x816168f 0xb772640c 0x818d283 0x838cb03 0x853b134 0x853d688 0x853d805 0x87e1d79 0x87e3c1d 0x857cf84 0x857f3e6 0x8580fbf 0x83d6693 0x87aba3b 0x8572d97 0x8165264 0x8165ec5 0x8166bc5 
       ./mongod(_ZN5mongo15printStackTraceERSo+0x2a) [0x8640bda]
       ./mongod(_ZN5mongo10abruptQuitEi+0x3a2) [0x8161012]
       ./mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x2af) [0x816168f]
       [0xb772640c]
       ./mongod(_ZNK5mongo7BSONObj4copyEv+0x33) [0x818d283]
       ./mongod(_ZN5mongo16MultiPlanScannerC1EPKcRKNS_7BSONObjES5_RKN5boost10shared_ptrIKNS_11ParsedQueryEEES5_NS_12QueryPlanSet18RecordedPlanPolicyES5_S5_+0x433) [0x838cb03]
       ./mongod(_ZN5mongo15CursorGenerator19setMultiPlanScannerEv+0x114) [0x853b134]
       ./mongod(_ZN5mongo15CursorGenerator8generateEv+0x78) [0x853d688]
       ./mongod(_ZN5mongo25NamespaceDetailsTransient9getCursorEPKcRKNS_7BSONObjES5_RKNS_24QueryPlanSelectionPolicyEPbRKN5boost10shared_ptrIKNS_11ParsedQueryEEEPNS_16QueryPlanSummaryE+0x65) [0x853d805]
       ./mongod(_ZN5mongo23queryWithQueryOptimizerERNS_7MessageEiPKcRKNS_7BSONObjERNS_5CurOpES6_S6_RKN5boost10shared_ptrINS_11ParsedQueryEEES6_RKNS_17ShardChunkVersionES1_+0xd9) [0x87e1d79]
       ./mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x7ad) [0x87e3c1d]
       ./mongod() [0x857cf84]
       ./mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x476) [0x857f3e6]
       ./mongod(_ZN5mongo14DBDirectClient4callERNS_7MessageES2_bPSs+0x7f) [0x8580fbf]
       ./mongod(_ZN5mongo14DBClientCursor4initEv+0xc3) [0x83d6693]
       ./mongod(_ZN5mongo12DBClientBase5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0xbb) [0x87aba3b]
       ./mongod(_ZN5mongo14DBDirectClient5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0x77) [0x8572d97]
       ./mongod(_ZN5mongo19clearTmpCollectionsEv+0x234) [0x8165264]
       ./mongod(_ZN5mongo14_initAndListenEi+0x495) [0x8165ec5]
       ./mongod(_ZN5mongo13initAndListenEi+0x25) [0x8166bc5]
      
      Logstream::get called in uninitialized state
      Tue Apr 17 19:23:28 [initandlisten] ERROR: Client::~Client _context should be null but is not; client:initandlisten
      Logstream::get called in uninitialized state
      Tue Apr 17 19:23:28 [initandlisten] ERROR: Client::shutdown not called: initandlisten
      

        Issue Links

          Activity

          Hide
          Aaron Staple (Inactive)
          added a comment -

          My theory is that this stack trace results from incompatibility of the recently updated AtomicUInt::set implementation with the user's older system and processor.

          I believe the user reported a startup error using --directoryperdb but no startup error without --directoryperdb because all their data files were in the --directoryperdb directory configuration and clearTmpCollections() did not trigger any queries without --directoryperdb because no data files were encountered. The user stated they were unable to use the database after startup in non --directoryperdb mode though did not provide a server stack trace. I believe the problem is not related to the --directoryperdb configuration.

          I believe the illegal instruction signal may be coming from the AtomicUInt::set() call made by BSONObj::copy(), which is called from BSONObj::getOwned() when the MultiPlanScanner's constructor calls query.getOwned().

          I did a non definitive investigation of the environment in which the binary the user ran was compiled. On the 32 bit buildbot machine g++ -v reports

          Using built-in specs.
          Target: i386-redhat-linux
          Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --with-cpu=generic --host=i386-redhat-linux
          Thread model: posix
          gcc version 4.1.2 20070925 (Red Hat 4.1.2-33)
          

          Compiling a test program

          #include <iostream>
          using namespace std;
          
          int main() {
            #if defined(__GCC_HAVE_SYNC_COMPARE_AND_SWAP_4)
            cout << "have __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4" << endl;
            #else
            cout << "don't have __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4" << endl;
            #endif
            return 0;
          }
          

          results in

          [root@…]# g++ -o test test.cpp && ./test
          don't have __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4
          

          This suggests to me that binaries coming from this machine will implement AtomicUInt::set() using an mfence instruction. A bit of digging suggests that mfence was added to x86 with sse2. However the user reports running an Athlon XP 2500+ processor, which seems to support sse but not sse2.

          In addition the date of the change to the AtomicUInt::set() function is consistent with the behavioral change window reported by the user.

          Note that I have not verified above that the g++ I ran from the command line is the same version used by scons, or checked explicitly that the binary uses the mfence instruction in the specified function, or that the operation address in the nightly build the user tested corresponds to an "advanced" instruction. (The __sync_synchronize() implementation of AtomicUInt::set() might potentially cause this user problems as well, but I haven't investigated that case.)

          Show
          Aaron Staple (Inactive)
          added a comment - My theory is that this stack trace results from incompatibility of the recently updated AtomicUInt::set implementation with the user's older system and processor. I believe the user reported a startup error using --directoryperdb but no startup error without --directoryperdb because all their data files were in the --directoryperdb directory configuration and clearTmpCollections() did not trigger any queries without --directoryperdb because no data files were encountered. The user stated they were unable to use the database after startup in non --directoryperdb mode though did not provide a server stack trace. I believe the problem is not related to the --directoryperdb configuration. I believe the illegal instruction signal may be coming from the AtomicUInt::set() call made by BSONObj::copy(), which is called from BSONObj::getOwned() when the MultiPlanScanner's constructor calls query.getOwned(). I did a non definitive investigation of the environment in which the binary the user ran was compiled. On the 32 bit buildbot machine g++ -v reports Using built-in specs. Target: i386-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --with-cpu=generic --host=i386-redhat-linux Thread model: posix gcc version 4.1.2 20070925 (Red Hat 4.1.2-33) Compiling a test program #include <iostream> using namespace std; int main() { #if defined(__GCC_HAVE_SYNC_COMPARE_AND_SWAP_4) cout << "have __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4" << endl; #else cout << "don't have __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4" << endl; #endif return 0; } results in [root@…]# g++ -o test test.cpp && ./test don't have __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 This suggests to me that binaries coming from this machine will implement AtomicUInt::set() using an mfence instruction. A bit of digging suggests that mfence was added to x86 with sse2. However the user reports running an Athlon XP 2500+ processor, which seems to support sse but not sse2. In addition the date of the change to the AtomicUInt::set() function is consistent with the behavioral change window reported by the user. Note that I have not verified above that the g++ I ran from the command line is the same version used by scons, or checked explicitly that the binary uses the mfence instruction in the specified function, or that the operation address in the nightly build the user tested corresponds to an "advanced" instruction. (The __sync_synchronize() implementation of AtomicUInt::set() might potentially cause this user problems as well, but I haven't investigated that case.)
          Hide
          Andy Schwerin
          added a comment -

          This appears to be the problem. I had not considered pre-x64 era 32-bit x86 processors. They indeed lack mfence.

          One solution is to use CPUID as the fencing instruction on 32-bit systems. This will make our performance on newer 32-bit x86 machines marginally less than optimal, but the alternatives aren't pretty.

          Another option is to provide a uniprocessors-only implementation of mongo, that doesn't use any fence in its atomics at all, but refuses to run on multiprocessor systems. That's a lot of work to get underway, so probably not a short term solution.

          Show
          Andy Schwerin
          added a comment - This appears to be the problem. I had not considered pre-x64 era 32-bit x86 processors. They indeed lack mfence. One solution is to use CPUID as the fencing instruction on 32-bit systems. This will make our performance on newer 32-bit x86 machines marginally less than optimal, but the alternatives aren't pretty. Another option is to provide a uniprocessors-only implementation of mongo, that doesn't use any fence in its atomics at all, but refuses to run on multiprocessor systems. That's a lot of work to get underway, so probably not a short term solution.
          Hide
          Andy Schwerin
          added a comment -

          Given the age of the processors involved, we're going to hold off on this.

          Show
          Andy Schwerin
          added a comment - Given the age of the processors involved, we're going to hold off on this.
          Hide
          John P Masseria
          added a comment -

          Given that this defect for older CPU processors will not be corrected, shouldn't the MongoDB daemon initialization be modified to detect "non-supported" CPUs, display an appropriate error messasge and exit gracefully as opposed to allowing the software to error down the road after executing an illegal instruction?

          Show
          John P Masseria
          added a comment - Given that this defect for older CPU processors will not be corrected, shouldn't the MongoDB daemon initialization be modified to detect "non-supported" CPUs, display an appropriate error messasge and exit gracefully as opposed to allowing the software to error down the road after executing an illegal instruction?
          Hide
          Ian Whalen
          added a comment -

          John, I've just created a new ticket (SERVER-7251) so that was can triage and track this request separately.

          Show
          Ian Whalen
          added a comment - John, I've just created a new ticket ( SERVER-7251 ) so that was can triage and track this request separately.

            People

            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                1 year, 28 weeks ago
                Date of 1st Reply: