[SERVER-18838] DB fails to recover creates and drops after system crash Created: 05/Jun/15  Updated: 07/Dec/15  Resolved: 29/Jun/15

Status: Closed
Project: Core Server
Component/s: Storage, WiredTiger
Affects Version/s: 3.1.4
Fix Version/s: 3.0.5, 3.1.5

Type: Bug Priority: Major - P3
Reporter: Jonathan Abrahams Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 0
Labels: 32powercycle, 32qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File fsm.sh     File mongodb-powertest.tar.gz     File powertest.sh     File pr-default-3.tar.gz    
Issue Links:
Depends
depends on WT-1960 Metadata crash consistency Closed
is depended on by WT-1973 MongoDB changes for WiredTiger 2.7.0 Closed
Related
is related to SERVER-18316 Database with WT engine fails to reco... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Steps To Reproduce:

This scenario was using the power cycle test script, along with an FSM client. The FSM client, jstests/concurrency/fsm_all.js, provides more DB operations, besides the standard CRUD ones, like

  • create and drop databases
  • create and drop collections
  • create and drop indexes
  • rename collection

To reproduce this scenario, perform the following operations:

  1. ./powertest.sh -s <ssh server host access> -l 20
    • i.e., ./powertest.sh -s "-i x.pem user@10.4.1.1" -l 20
    • Note user must have sudo privileges on server
  2. ./fsm.sh <server host>
    • i.e., ./fsm.sh 10.4.1.1
Participants:

 Description   
Issue Status as of Jul 14, 2015

ISSUE SUMMARY
If interrupted by an unexpected server termination (e.g. power loss) certain operations, including database, collection, and index creation and deletion (i.e. drop), as well as collection renames may result in an invalid data set, causing MongoDB to crash immediately after starting. These operations include:

  • database creation
  • database dropping
  • collection creation
  • collection dropping
  • index creation
  • index dropping

USER IMPACT
Affected MongoDB instances may not be able to start, leading to loss of availability of the affected node.

WORKAROUNDS
There are no workarounds for this issue. If a node is affected, users can resync from a healthy member in the replica set.

Users are urged to consider additional measures to prevent unclean shutdowns.

AFFECTED VERSIONS
MongoDB 3.0.0 through 3.0.4.

FIX VERSION
The fix is included in the 3.0.5 production release.

Original description

This recovery failure was found on both EXT4 & XFS file systems:

2015-06-05T10:36:26.053-0400 I CONTROL  [thread1] ***** SERVER RESTARTED *****
2015-06-05T10:36:26.130-0400 W -        [initandlisten] Detected unclean shutdown - /home/jonathan/mongodb-powertest/data/wiredTiger/mongod.lock is not empty.
2015-06-05T10:36:26.130-0400 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-06-05T10:36:26.130-0400 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-06-05T10:36:26.376-0400 E STORAGE  [initandlisten] WiredTiger (2) [1433514986:376579][1528:0x7f5c0368cb80], file:collection-1-8498577547266750689.wt: /home/jonathan/mongodb-powertest/data/wiredTiger/collection-1-8498577547266750689.wt: No such file or directory
2015-06-05T10:36:26.376-0400 E STORAGE  [initandlisten] WiredTiger (2) [1433514986:376641][1528:0x7f5c0368cb80], file:collection-1-8498577547266750689.wt: Operation failed during recovery: No such file or directory
2015-06-05T10:36:26.947-0400 E STORAGE  [initandlisten] WiredTiger (2) [1433514986:947150][1528:0x7f5c0368cb80], file:collection-1-8498577547266750689.wt, WT_SESSION.open_cursor: /home/jonathan/mongodb-powertest/data/wiredTiger/collection-1-8498577547266750689.wt: No such file or directory
2015-06-05T10:36:26.947-0400 E STORAGE  [initandlisten] no cursor for uri: table:collection-1-8498577547266750689
2015-06-05T10:36:26.947-0400 I -        [initandlisten] Invariant failure c src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp 1047
2015-06-05T10:36:27.149-0400 I CONTROL  [initandlisten]
 0xfdb0c6 0xf8fb48 0xf76826 0xdd5219 0xdd53cc 0xdd60f3 0xdd640f 0xdcf808 0xd3dab6 0xd42bee 0xdce3fc 0xd06ac6 0x80b3aa 0x811c34 0x7f5c02294ec5 0x809719
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"BDB0C6"},{"b":"400000","o":"B8FB48"},{"b":"400000","o":"B76826"},{"b":"400000","o":"9D5219"},{"b":"400000","o":"9D53CC"},{"b":"400000","o":"9D60F3"},{"b":"400000","o":"9D640F"},{"b":"400000","o":"9CF808"},{"b":"400000","o":"93DAB6"},{"b":"400000","o":"942BEE"},{"b":"400000","o":"9CE3FC"},{"b":"400000","o":"906AC6"},{"b":"400000","o":"40B3AA"},{"b":"400000","o":"411C34"},{"b":"7F5C02273000","o":"21EC5"},{"b":"400000","o":"409719"}],"processInfo":{ "mongodbVersion" : "3.1.4-pre-", "gitVersion" : "ce2b1d11d42de93f7b375f7e6c41fb709f66e969", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.16.0-31-generic", "version" : "#41~14.04.1-Ubuntu SMP Wed Feb 11 19:30:13 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFF36DEC000", "elfType" : 3 }, { "b" : "7F5C0327A000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3 }, { "b" : "7F5C03076000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3 }, { "b" : "7F5C02D72000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3 }, { "b" : "7F5C02A6C000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3 }, { "b" : "7F5C02856000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7F5C02638000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3 }, { "b" : "7F5C02273000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3 }, { "b" : "7F5C03482000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x26) [0xfdb0c6]
 mongod(_ZN5mongo10logContextEPKc+0x138) [0xf8fb48]
 mongod(_ZN5mongo15invariantFailedEPKcS1_j+0x96) [0xf76826]
 mongod(_ZN5mongo21WiredTigerRecordStore8Iterator7_locateERKNS_8RecordIdEb+0x1C9) [0xdd5219]
 mongod(_ZN5mongo21WiredTigerRecordStore8IteratorC2ERKS0_PNS_16OperationContextERKNS_8RecordIdERKNS_20CollectionScanParams9DirectionEb+0x9C) [0xdd53cc]
 mongod(_ZNK5mongo21WiredTigerRecordStore11getIteratorEPNS_16OperationContextERKNS_8RecordIdERKNS_20CollectionScanParams9DirectionE+0x43) [0xdd60f3]
 mongod(_ZN5mongo21WiredTigerRecordStoreC1EPNS_16OperationContextENS_10StringDataES3_bllPNS_28CappedDocumentDeleteCallbackEPNS_20WiredTigerSizeStorerE+0x2CF) [0xdd640f]
 mongod(_ZN5mongo18WiredTigerKVEngine14getRecordStoreEPNS_16OperationContextENS_10StringDataES3_RKNS_17CollectionOptionsE+0x158) [0xdcf808]
 mongod(_ZN5mongo22KVDatabaseCatalogEntry14initCollectionEPNS_16OperationContextERKSsb+0x1D6) [0xd3dab6]
 mongod(_ZN5mongo15KVStorageEngineC1EPNS_8KVEngineERKNS_22KVStorageEngineOptionsE+0x61E) [0xd42bee]
 mongod(+0x9CE3FC) [0xdce3fc]
 mongod(_ZN5mongo20ServiceContextMongoD29initializeGlobalStorageEngineEv+0x5E6) [0xd06ac6]
 mongod(_ZN5mongo13initAndListenEi+0x16A) [0x80b3aa]
 mongod(main+0x134) [0x811c34]
 libc.so.6(__libc_start_main+0xF5) [0x7f5c02294ec5]
 mongod(+0x409719) [0x809719]
-----  END BACKTRACE  -----
2015-06-05T10:36:27.149-0400 I -        [initandlisten]
 
***aborting after invariant() failure

A different failure observed from a repeat of the scenario:

2015-06-05T10:51:00.593-0400 I CONTROL  [thread1] ***** SERVER RESTARTED *****
2015-06-05T10:51:00.672-0400 W -        [initandlisten] Detected unclean shutdown - /home/jonathan/pr-default-3/data/wiredTiger/mongod.lock is not empty.
2015-06-05T10:51:00.672-0400 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-06-05T10:51:00.672-0400 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-06-05T10:51:00.962-0400 I -        [initandlisten] Fatal assertion 28548 NoSuchKey Unable to find metadata for table:collection-2--3916750969192802253
2015-06-05T10:51:00.962-0400 I -        [initandlisten]
 
***aborting after fassert() failure



 Comments   
Comment by Githook User [ 29/Jun/15 ]

Author:

{u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}

Message: SERVER-18838 During drops, don't remove files until the metadata is durable.

(cherry picked from commit b847cccb204b0d621c9e14a07368dd1287a77b87)
Branch: mongodb-3.0
https://github.com/wiredtiger/wiredtiger/commit/9f72da7968f828753276aaf92afbb0c82f75a3b4

Comment by Githook User [ 25/Jun/15 ]

Author:

{u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}

Message: Merge pull request #2029 from wiredtiger/fileops-durability

SERVER-18838 During drops, don't remove files until the metadata is durable
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/b847cccb204b0d621c9e14a07368dd1287a77b87

Comment by Githook User [ 25/Jun/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-18838 During drops, don't remove files until the metadata is stable.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/5e08db8d5c3a05b8f6d66997e3679716f85a23ad

Comment by Githook User [ 25/Jun/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: Revert "SERVER-18838 flush WT metadata during create/drop so we don't see missing files after a crash."

This reverts commit 5c4878f7cfa3cdf13cd11025803e6c3305fc6c60.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/01e822b1ae029636958e03ba89d2f5594243f3ca

Comment by Githook User [ 25/Jun/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-18838 flush WT metadata during create/drop so we don't see missing files after a crash.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/5c4878f7cfa3cdf13cd11025803e6c3305fc6c60

Comment by Jonathan Abrahams [ 18/Jun/15 ]

On Windows, we see this failure:

2015-06-16T11:13:43.540-0700 I CONTROL  [thread1] ***** SERVER RESTARTED *****
2015-06-16T11:13:43.557-0700 I CONTROL  [thread1] Trying to start Windows service 'MongoDB'
2015-06-16T11:13:43.559-0700 I STORAGE  [thread2] Service running
2015-06-16T11:13:43.572-0700 W -        [initandlisten] Detected unclean shutdown - C:\\cygwin64\\home\\Administrator\\pt-wt\\data\\wiredTiger\mongod.lock is not empty.
2015-06-16T11:13:43.577-0700 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-06-16T11:13:43.577-0700 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-06-16T11:13:44.429-0700 E STORAGE  [initandlisten] WiredTiger (-28998) [1434478424:428453][2000:140727990555424], file:collection-919-1796960253972883549.wt: C:\\cygwin64\\home\\Administrator\\pt-wt\\data\\wiredTiger\collection-919-1796960253972883549.wt: The system cannot find the file specified.
2015-06-16T11:13:44.429-0700 E STORAGE  [initandlisten] WiredTiger (-28998) [1434478424:429412][2000:140727990555424], file:collection-919-1796960253972883549.wt: Operation failed during recovery: The system cannot find the file specified.
2015-06-16T11:13:44.697-0700 I -        [initandlisten] Assertion: 28595:-28998: The system cannot find the file specified.
 
2015-06-16T11:13:44.751-0700 I STORAGE  [initandlisten] exception in initAndListen: 28595 -28998: The system cannot find the file specified.
, terminating
2015-06-16T11:13:44.751-0700 I CONTROL  [serviceStopWorker] dbexit:  rc: 49

Comment by Jonathan Abrahams [ 10/Jun/15 ]

Confirmed this is a problem in 3.1.4. Need the fsm.sh client to to stimulate it.

Comment by Michael Cahill (Inactive) [ 05/Jun/15 ]

redbeard0531, we talked recently about how MongoDB guarantees create and drop are transactional in spite of the fact that they aren't in WiredTiger.

In the second failure mode reported here, we're seeing a case where the MongoDB metadata references a table that WiredTiger doesn't know about after restart. That could either be because the collection entry in the catalog was made persistent before the corresponding WiredTiger table was created, OR that the WiredTiger table was dropped before the catalog change was made durable.

Do you have any suggestions for how to track this one down?

Comment by Jonathan Abrahams [ 05/Jun/15 ]

Note - this problem occurred using the 3.1.4-pre nightly.

Generated at Thu Feb 08 03:48:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.