Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-18838

DB fails to recover creates and drops after system crash

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Completed:
    • Steps To Reproduce:
      Hide

      This scenario was using the power cycle test script, along with an FSM client. The FSM client, jstests/concurrency/fsm_all.js, provides more DB operations, besides the standard CRUD ones, like

      • create and drop databases
      • create and drop collections
      • create and drop indexes
      • rename collection

      To reproduce this scenario, perform the following operations:

      1. ./powertest.sh -s <ssh server host access> -l 20
        • i.e., ./powertest.sh -s "-i x.pem user@10.4.1.1" -l 20
        • Note user must have sudo privileges on server
      2. ./fsm.sh <server host>
        • i.e., ./fsm.sh 10.4.1.1
      Show
      This scenario was using the power cycle test script, along with an FSM client. The FSM client, jstests/concurrency/fsm_all.js, provides more DB operations, besides the standard CRUD ones, like create and drop databases create and drop collections create and drop indexes rename collection To reproduce this scenario, perform the following operations: ./powertest.sh -s <ssh server host access> -l 20 i.e., ./powertest.sh -s "-i x.pem user@10.4.1.1" -l 20 Note user must have sudo privileges on server ./fsm.sh <server host> i.e., ./fsm.sh 10.4.1.1

      Description

      Issue Status as of Jul 14, 2015

      ISSUE SUMMARY
      If interrupted by an unexpected server termination (e.g. power loss) certain operations, including database, collection, and index creation and deletion (i.e. drop), as well as collection renames may result in an invalid data set, causing MongoDB to crash immediately after starting. These operations include:

      • database creation
      • database dropping
      • collection creation
      • collection dropping
      • index creation
      • index dropping

      USER IMPACT
      Affected MongoDB instances may not be able to start, leading to loss of availability of the affected node.

      WORKAROUNDS
      There are no workarounds for this issue. If a node is affected, users can resync from a healthy member in the replica set.

      Users are urged to consider additional measures to prevent unclean shutdowns.

      AFFECTED VERSIONS
      MongoDB 3.0.0 through 3.0.4.

      FIX VERSION
      The fix is included in the 3.0.5 production release.

      Original description

      This recovery failure was found on both EXT4 & XFS file systems:

      2015-06-05T10:36:26.053-0400 I CONTROL  [thread1] ***** SERVER RESTARTED *****
      2015-06-05T10:36:26.130-0400 W -        [initandlisten] Detected unclean shutdown - /home/jonathan/mongodb-powertest/data/wiredTiger/mongod.lock is not empty.
      2015-06-05T10:36:26.130-0400 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
      2015-06-05T10:36:26.130-0400 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
      2015-06-05T10:36:26.376-0400 E STORAGE  [initandlisten] WiredTiger (2) [1433514986:376579][1528:0x7f5c0368cb80], file:collection-1-8498577547266750689.wt: /home/jonathan/mongodb-powertest/data/wiredTiger/collection-1-8498577547266750689.wt: No such file or directory
      2015-06-05T10:36:26.376-0400 E STORAGE  [initandlisten] WiredTiger (2) [1433514986:376641][1528:0x7f5c0368cb80], file:collection-1-8498577547266750689.wt: Operation failed during recovery: No such file or directory
      2015-06-05T10:36:26.947-0400 E STORAGE  [initandlisten] WiredTiger (2) [1433514986:947150][1528:0x7f5c0368cb80], file:collection-1-8498577547266750689.wt, WT_SESSION.open_cursor: /home/jonathan/mongodb-powertest/data/wiredTiger/collection-1-8498577547266750689.wt: No such file or directory
      2015-06-05T10:36:26.947-0400 E STORAGE  [initandlisten] no cursor for uri: table:collection-1-8498577547266750689
      2015-06-05T10:36:26.947-0400 I -        [initandlisten] Invariant failure c src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp 1047
      2015-06-05T10:36:27.149-0400 I CONTROL  [initandlisten]
       0xfdb0c6 0xf8fb48 0xf76826 0xdd5219 0xdd53cc 0xdd60f3 0xdd640f 0xdcf808 0xd3dab6 0xd42bee 0xdce3fc 0xd06ac6 0x80b3aa 0x811c34 0x7f5c02294ec5 0x809719
      ----- BEGIN BACKTRACE -----
      {"backtrace":[{"b":"400000","o":"BDB0C6"},{"b":"400000","o":"B8FB48"},{"b":"400000","o":"B76826"},{"b":"400000","o":"9D5219"},{"b":"400000","o":"9D53CC"},{"b":"400000","o":"9D60F3"},{"b":"400000","o":"9D640F"},{"b":"400000","o":"9CF808"},{"b":"400000","o":"93DAB6"},{"b":"400000","o":"942BEE"},{"b":"400000","o":"9CE3FC"},{"b":"400000","o":"906AC6"},{"b":"400000","o":"40B3AA"},{"b":"400000","o":"411C34"},{"b":"7F5C02273000","o":"21EC5"},{"b":"400000","o":"409719"}],"processInfo":{ "mongodbVersion" : "3.1.4-pre-", "gitVersion" : "ce2b1d11d42de93f7b375f7e6c41fb709f66e969", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.16.0-31-generic", "version" : "#41~14.04.1-Ubuntu SMP Wed Feb 11 19:30:13 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFF36DEC000", "elfType" : 3 }, { "b" : "7F5C0327A000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3 }, { "b" : "7F5C03076000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3 }, { "b" : "7F5C02D72000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3 }, { "b" : "7F5C02A6C000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3 }, { "b" : "7F5C02856000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7F5C02638000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3 }, { "b" : "7F5C02273000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3 }, { "b" : "7F5C03482000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
       mongod(_ZN5mongo15printStackTraceERSo+0x26) [0xfdb0c6]
       mongod(_ZN5mongo10logContextEPKc+0x138) [0xf8fb48]
       mongod(_ZN5mongo15invariantFailedEPKcS1_j+0x96) [0xf76826]
       mongod(_ZN5mongo21WiredTigerRecordStore8Iterator7_locateERKNS_8RecordIdEb+0x1C9) [0xdd5219]
       mongod(_ZN5mongo21WiredTigerRecordStore8IteratorC2ERKS0_PNS_16OperationContextERKNS_8RecordIdERKNS_20CollectionScanParams9DirectionEb+0x9C) [0xdd53cc]
       mongod(_ZNK5mongo21WiredTigerRecordStore11getIteratorEPNS_16OperationContextERKNS_8RecordIdERKNS_20CollectionScanParams9DirectionE+0x43) [0xdd60f3]
       mongod(_ZN5mongo21WiredTigerRecordStoreC1EPNS_16OperationContextENS_10StringDataES3_bllPNS_28CappedDocumentDeleteCallbackEPNS_20WiredTigerSizeStorerE+0x2CF) [0xdd640f]
       mongod(_ZN5mongo18WiredTigerKVEngine14getRecordStoreEPNS_16OperationContextENS_10StringDataES3_RKNS_17CollectionOptionsE+0x158) [0xdcf808]
       mongod(_ZN5mongo22KVDatabaseCatalogEntry14initCollectionEPNS_16OperationContextERKSsb+0x1D6) [0xd3dab6]
       mongod(_ZN5mongo15KVStorageEngineC1EPNS_8KVEngineERKNS_22KVStorageEngineOptionsE+0x61E) [0xd42bee]
       mongod(+0x9CE3FC) [0xdce3fc]
       mongod(_ZN5mongo20ServiceContextMongoD29initializeGlobalStorageEngineEv+0x5E6) [0xd06ac6]
       mongod(_ZN5mongo13initAndListenEi+0x16A) [0x80b3aa]
       mongod(main+0x134) [0x811c34]
       libc.so.6(__libc_start_main+0xF5) [0x7f5c02294ec5]
       mongod(+0x409719) [0x809719]
      -----  END BACKTRACE  -----
      2015-06-05T10:36:27.149-0400 I -        [initandlisten]
       
      ***aborting after invariant() failure
      

      A different failure observed from a repeat of the scenario:

      2015-06-05T10:51:00.593-0400 I CONTROL  [thread1] ***** SERVER RESTARTED *****
      2015-06-05T10:51:00.672-0400 W -        [initandlisten] Detected unclean shutdown - /home/jonathan/pr-default-3/data/wiredTiger/mongod.lock is not empty.
      2015-06-05T10:51:00.672-0400 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
      2015-06-05T10:51:00.672-0400 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
      2015-06-05T10:51:00.962-0400 I -        [initandlisten] Fatal assertion 28548 NoSuchKey Unable to find metadata for table:collection-2--3916750969192802253
      2015-06-05T10:51:00.962-0400 I -        [initandlisten]
       
      ***aborting after fassert() failure
      

      1. fsm.sh
        0.2 kB
        Jonathan Abrahams
      2. mongodb-powertest.tar.gz
        39.37 MB
        Jonathan Abrahams
      3. powertest.sh
        25 kB
        Jonathan Abrahams
      4. pr-default-3.tar.gz
        42.43 MB
        Jonathan Abrahams

        Issue Links

          Activity

          Hide
          jonathan.abrahams Jonathan Abrahams added a comment -

          Note - this problem occurred using the 3.1.4-pre nightly.

          Show
          jonathan.abrahams Jonathan Abrahams added a comment - Note - this problem occurred using the 3.1.4-pre nightly.
          Hide
          michael.cahill Michael Cahill added a comment -

          Mathias Stearn, we talked recently about how MongoDB guarantees create and drop are transactional in spite of the fact that they aren't in WiredTiger.

          In the second failure mode reported here, we're seeing a case where the MongoDB metadata references a table that WiredTiger doesn't know about after restart. That could either be because the collection entry in the catalog was made persistent before the corresponding WiredTiger table was created, OR that the WiredTiger table was dropped before the catalog change was made durable.

          Do you have any suggestions for how to track this one down?

          Show
          michael.cahill Michael Cahill added a comment - Mathias Stearn , we talked recently about how MongoDB guarantees create and drop are transactional in spite of the fact that they aren't in WiredTiger. In the second failure mode reported here, we're seeing a case where the MongoDB metadata references a table that WiredTiger doesn't know about after restart. That could either be because the collection entry in the catalog was made persistent before the corresponding WiredTiger table was created, OR that the WiredTiger table was dropped before the catalog change was made durable. Do you have any suggestions for how to track this one down?
          Hide
          jonathan.abrahams Jonathan Abrahams added a comment -

          Confirmed this is a problem in 3.1.4. Need the fsm.sh client to to stimulate it.

          Show
          jonathan.abrahams Jonathan Abrahams added a comment - Confirmed this is a problem in 3.1.4. Need the fsm.sh client to to stimulate it.
          Hide
          jonathan.abrahams Jonathan Abrahams added a comment -

          On Windows, we see this failure:

          2015-06-16T11:13:43.540-0700 I CONTROL  [thread1] ***** SERVER RESTARTED *****
          2015-06-16T11:13:43.557-0700 I CONTROL  [thread1] Trying to start Windows service 'MongoDB'
          2015-06-16T11:13:43.559-0700 I STORAGE  [thread2] Service running
          2015-06-16T11:13:43.572-0700 W -        [initandlisten] Detected unclean shutdown - C:\\cygwin64\\home\\Administrator\\pt-wt\\data\\wiredTiger\mongod.lock is not empty.
          2015-06-16T11:13:43.577-0700 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
          2015-06-16T11:13:43.577-0700 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
          2015-06-16T11:13:44.429-0700 E STORAGE  [initandlisten] WiredTiger (-28998) [1434478424:428453][2000:140727990555424], file:collection-919-1796960253972883549.wt: C:\\cygwin64\\home\\Administrator\\pt-wt\\data\\wiredTiger\collection-919-1796960253972883549.wt: The system cannot find the file specified.
          2015-06-16T11:13:44.429-0700 E STORAGE  [initandlisten] WiredTiger (-28998) [1434478424:429412][2000:140727990555424], file:collection-919-1796960253972883549.wt: Operation failed during recovery: The system cannot find the file specified.
          2015-06-16T11:13:44.697-0700 I -        [initandlisten] Assertion: 28595:-28998: The system cannot find the file specified.
           
          2015-06-16T11:13:44.751-0700 I STORAGE  [initandlisten] exception in initAndListen: 28595 -28998: The system cannot find the file specified.
          , terminating
          2015-06-16T11:13:44.751-0700 I CONTROL  [serviceStopWorker] dbexit:  rc: 49
          

          Show
          jonathan.abrahams Jonathan Abrahams added a comment - On Windows, we see this failure: 2015-06-16T11:13:43.540-0700 I CONTROL [thread1] ***** SERVER RESTARTED ***** 2015-06-16T11:13:43.557-0700 I CONTROL [thread1] Trying to start Windows service 'MongoDB' 2015-06-16T11:13:43.559-0700 I STORAGE [thread2] Service running 2015-06-16T11:13:43.572-0700 W - [initandlisten] Detected unclean shutdown - C:\\cygwin64\\home\\Administrator\\pt-wt\\data\\wiredTiger\mongod.lock is not empty. 2015-06-16T11:13:43.577-0700 W STORAGE [initandlisten] Recovering data from the last clean checkpoint. 2015-06-16T11:13:43.577-0700 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0), 2015-06-16T11:13:44.429-0700 E STORAGE [initandlisten] WiredTiger (-28998) [1434478424:428453][2000:140727990555424], file:collection-919-1796960253972883549.wt: C:\\cygwin64\\home\\Administrator\\pt-wt\\data\\wiredTiger\collection-919-1796960253972883549.wt: The system cannot find the file specified. 2015-06-16T11:13:44.429-0700 E STORAGE [initandlisten] WiredTiger (-28998) [1434478424:429412][2000:140727990555424], file:collection-919-1796960253972883549.wt: Operation failed during recovery: The system cannot find the file specified. 2015-06-16T11:13:44.697-0700 I - [initandlisten] Assertion: 28595:-28998: The system cannot find the file specified.   2015-06-16T11:13:44.751-0700 I STORAGE [initandlisten] exception in initAndListen: 28595 -28998: The system cannot find the file specified. , terminating 2015-06-16T11:13:44.751-0700 I CONTROL [serviceStopWorker] dbexit: rc: 49
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

          Message: SERVER-18838 flush WT metadata during create/drop so we don't see missing files after a crash.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/5c4878f7cfa3cdf13cd11025803e6c3305fc6c60

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'} Message: SERVER-18838 flush WT metadata during create/drop so we don't see missing files after a crash. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/5c4878f7cfa3cdf13cd11025803e6c3305fc6c60
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

          Message: Revert "SERVER-18838 flush WT metadata during create/drop so we don't see missing files after a crash."

          This reverts commit 5c4878f7cfa3cdf13cd11025803e6c3305fc6c60.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/01e822b1ae029636958e03ba89d2f5594243f3ca

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'} Message: Revert " SERVER-18838 flush WT metadata during create/drop so we don't see missing files after a crash." This reverts commit 5c4878f7cfa3cdf13cd11025803e6c3305fc6c60. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/01e822b1ae029636958e03ba89d2f5594243f3ca
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

          Message: SERVER-18838 During drops, don't remove files until the metadata is stable.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/5e08db8d5c3a05b8f6d66997e3679716f85a23ad

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'} Message: SERVER-18838 During drops, don't remove files until the metadata is stable. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/5e08db8d5c3a05b8f6d66997e3679716f85a23ad
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}

          Message: Merge pull request #2029 from wiredtiger/fileops-durability

          SERVER-18838 During drops, don't remove files until the metadata is durable
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/b847cccb204b0d621c9e14a07368dd1287a77b87

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'} Message: Merge pull request #2029 from wiredtiger/fileops-durability SERVER-18838 During drops, don't remove files until the metadata is durable Branch: develop https://github.com/wiredtiger/wiredtiger/commit/b847cccb204b0d621c9e14a07368dd1287a77b87
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}

          Message: SERVER-18838 During drops, don't remove files until the metadata is durable.

          (cherry picked from commit b847cccb204b0d621c9e14a07368dd1287a77b87)
          Branch: mongodb-3.0
          https://github.com/wiredtiger/wiredtiger/commit/9f72da7968f828753276aaf92afbb0c82f75a3b4

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'} Message: SERVER-18838 During drops, don't remove files until the metadata is durable. (cherry picked from commit b847cccb204b0d621c9e14a07368dd1287a77b87) Branch: mongodb-3.0 https://github.com/wiredtiger/wiredtiger/commit/9f72da7968f828753276aaf92afbb0c82f75a3b4

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: