[SERVER-20061] Entire data loss after connectivity issues with the storage Created: 20/Aug/15  Updated: 20/Aug/15  Resolved: 20/Aug/15

Status: Closed
Project: Core Server
Component/s: Admin
Affects Version/s: 3.0.2
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Rami stern Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

We've a MongoDB 3.0.2 replica running on Windows.
We had 2 nodes in the replica with security enabled, but had to remove one for maintenance, so the issue happened while the replica had only one node.

We than see the following Mongo log lines:

2015-08-08T06:37:19.115-0400 E STORAGE  [conn37692] WiredTiger (-28936) [1439030239:114410][4392:2004759424], session.commit_transaction: journal/WiredTigerLog.0000000714 write error: failed to write 22528 bytes at offset 23841280: The specified network name is no longer available.
2015-08-08T06:37:19.115-0400 I -        [conn37692] Invariant failure: s->commit_transaction(s, NULL) resulted in status UnknownError -28936: The specified network name is no longer available.
  at src\mongo\db\storage\wiredtiger\wiredtiger_recovery_unit.cpp 290
2015-08-08T06:37:19.381-0400 E STORAGE  [conn37617] WiredTiger (-28936) [1439030239:380436][4392:2004759424], session.commit_transaction: journal/WiredTigerLog.0000000714 write error: failed to write 4352 bytes at offset 23863808: The specified network name is no longer available.
2015-08-08T06:37:19.382-0400 I -        [conn37617] Invariant failure: s->commit_transaction(s, NULL) resulted in status UnknownError -28936: The specified network name is no longer available.
  at src\mongo\db\storage\wiredtiger\wiredtiger_recovery_unit.cpp 290
2015-08-08T06:37:19.488-0400 E STORAGE  [WiredTigerRecordStoreThread for local.oplog.rs] WiredTiger (-28936) [1439030239:488447][4392:2004759424], session.commit_transaction: journal/WiredTigerLog.0000000714 write error: failed to write 128 bytes at offset 23868160: The specified network name is no longer available.
2015-08-08T06:37:19.488-0400 I -        [WiredTigerRecordStoreThread for local.oplog.rs] Invariant failure: s->commit_transaction(s, NULL) resulted in status UnknownError -28936: The specified network name is no longer available.
  at src\mongo\db\storage\wiredtiger\wiredtiger_recovery_unit.cpp 290
2015-08-08T06:37:20.951-0400 I CONTROL  [conn37692] mongod.exe      index_collator_extension+0x141cf3
2015-08-08T06:37:20.951-0400 I CONTROL  [conn37692] mongod.exe      index_collator_extension+0xfaf17
2015-08-08T06:37:20.951-0400 I CONTROL  [conn37692] mongod.exe      index_collator_extension+0xedd21
2015-08-08T06:37:20.951-0400 I CONTROL  [conn37692] mongod.exe      index_collator_extension+0xb4ab
2015-08-08T06:37:20.951-0400 I CONTROL  [conn37692] mongod.exe      index_collator_extension+0xb16b
2015-08-08T06:37:20.951-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x1b5ffe
2015-08-08T06:37:20.951-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x1b9ad3
2015-08-08T06:37:20.951-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x2efe5d
2015-08-08T06:37:20.952-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x2efa9b
2015-08-08T06:37:20.952-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x14c05b
2015-08-08T06:37:20.952-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x1495e2
2015-08-08T06:37:20.952-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x148230
2015-08-08T06:37:20.952-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x149c3f
2015-08-08T06:37:20.952-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x14f251
2015-08-08T06:37:20.952-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x15ef1a
2015-08-08T06:37:20.952-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x1614b3
2015-08-08T06:37:20.952-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x15f380
2015-08-08T06:37:20.952-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x2c8ce3
2015-08-08T06:37:20.952-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x2c9103
2015-08-08T06:37:20.953-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x215e1b
2015-08-08T06:37:20.953-0400 I CONTROL  [conn37692] mongod.exe      mongo::parseNumberFromStringWithBase<unsigned __int64>+0x2109f2
2015-08-08T06:37:20.953-0400 I CONTROL  [conn37692] mongod.exe      index_collator_extension+0x43e568
2015-08-08T06:37:20.953-0400 I CONTROL  [conn37692] mongod.exe      index_collator_extension+0x1071c3
2015-08-08T06:37:20.953-0400 I CONTROL  [conn37692] mongod.exe      index_collator_extension+0x161a91
2015-08-08T06:37:20.953-0400 I CONTROL  [conn37692] MSVCR120.dll    beginthreadex+0x107
2015-08-08T06:37:20.953-0400 I CONTROL  [conn37692] MSVCR120.dll    endthreadex+0x192
2015-08-08T06:37:20.953-0400 I CONTROL  [conn37692] kernel32.dll    BaseThreadInitThunk+0xd
2015-08-08T06:37:20.957-0400 I CONTROL  [conn37692] 
2015-08-08T06:37:20.957-0400 I -        [conn37692] 
 
***aborting after invariant() failure
 
 
2015-08-08T06:38:47.685-0400 I CONTROL  ***** SERVER RESTARTED *****
2015-08-08T06:38:47.795-0400 I CONTROL  Hotfix KB2731284 or later update is not installed, will zero-out data files
2015-08-08T06:38:48.159-0400 I CONTROL  Trying to start Windows service 'MongoDB'
2015-08-08T06:38:48.160-0400 I STORAGE  Service running
2015-08-08T06:38:48.324-0400 W -        [initandlisten] Detected unclean shutdown - \\STORAGE\FOLDER\mongod.lock is not empty.
2015-08-08T06:38:48.326-0400 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-08-08T06:38:48.329-0400 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=4G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-08-08T06:40:05.322-0400 E STORAGE  [initandlisten] WiredTiger (-28998) [1439030405:322774][4460:2004759424], connection: \\STORAGE\FOLDER\journal/WiredTigerLog.0000000715: The system cannot find the file specified.
  a\journal
2015-08-08T06:40:05.331-0400 E STORAGE  [initandlisten] WiredTiger (-28994) [1439030405:331775][4460:2004759424], file:WiredTiger.wt, sweep-server: WiredTiger.wt SetEndOfFile error: The handle is invalid.
 
  he file specified.
 
  a\journal
2015-08-08T06:40:05.332-0400 E STORAGE  [initandlisten] WiredTiger (-28994) [1439030405:332775][4460:2004759424], file:WiredTiger.wt, sweep-server: Final close of file:WiredTiger.wt failed: The handle is invalid.
 
  he file specified.
 
  a\journal
2015-08-08T06:40:05.338-0400 I -        [initandlisten] Assertion: 28595:-28998: The system cannot find the file specified.
2015-08-08T06:40:05.366-0400 I STORAGE  [initandlisten] exception in initAndListen: 28595 -28998: The system cannot find the file specified.
 
, terminating
2015-08-08T06:40:05.366-0400 I CONTROL  [initandlisten] dbexit:  rc: 49
2015-08-08T06:40:23.877-0400 I CONTROL  ***** SERVER RESTARTED *****
2015-08-08T06:40:23.877-0400 I CONTROL  Hotfix KB2731284 or later update is not installed, will zero-out data files
2015-08-08T06:40:24.279-0400 I CONTROL  Trying to start Windows service 'MongoDB'
2015-08-08T06:40:24.281-0400 I STORAGE  Service running
2015-08-08T06:40:24.510-0400 W -        [initandlisten] Detected unclean shutdown - \\STORAGE\FOLDER\mongod.lock is not empty.
2015-08-08T06:40:24.513-0400 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-08-08T06:40:24.514-0400 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=4G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-08-08T06:40:34.965-0400 I CONTROL  [initandlisten] MongoDB starting : pid=1728 port=27017 dbpath=\STORAGE\FOLDER 64-bit host=miragemgt
2015-08-08T06:40:34.965-0400 I CONTROL  [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2
2015-08-08T06:40:34.965-0400 I CONTROL  [initandlisten] db version v3.0.2
2015-08-08T06:40:34.966-0400 I CONTROL  [initandlisten] git version: 6201872043ecbbc0a4cc169b5482dcf385fc464f
2015-08-08T06:40:34.966-0400 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1m-fips 19 Mar 2015
2015-08-08T06:40:34.966-0400 I CONTROL  [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_49
2015-08-08T06:40:34.966-0400 I CONTROL  [initandlisten] allocator: system
2015-08-08T06:40:34.966-0400 I CONTROL  [initandlisten] options: { config: "C:\mongod1.cfg", net: { port: 27017 }, operationProfiling: { slowOpThresholdMs: 5000 }, replication: { oplogSizeMB: 2048, replSetName: "REPLICA-NAME" }, security: { authorization: "enabled", keyFile: "C:\securityFile.txt" }, service: true, storage: { dbPath: "\\STORAGE\FOLDER", engine: "wiredTiger", journal: { enabled: true }, wiredTiger: { engineConfig: { cacheSizeGB: 4 } } }, systemLog: { destination: "file", logAppend: true, path: "C:\logs\mongod1.log", quiet: true } }
2015-08-08T06:40:35.012-0400 I REPL     [initandlisten] Did not find local replica set configuration document at startup;  NoMatchingDocument Did not find replica set configuration document in local.system.replset

After which we identify the replica configuration disappeared.
We recall rs.initiate() which succeeds, but than find the entire database data is gone, we've only one database (other than the default admin) with only one collection, and we see it contains no data.

Why did this happen? what should've we done differently?



 Comments   
Comment by Ramon Fernandez Marina [ 20/Aug/15 ]

Thanks for your report sternr-windowslive, and sorry you run into this. From the log snippets above it seems that the following sequence of events took place:

  • mongod aborted normal operation when the storage disappeared under it
  • mongod was restarted, detected the unclean shutdown, and tried to recover from the last checkpoint, but could not find the necessary journal files
  • mongod then tried to open the WiredTiger.wt file, but failed. Chances are this file was corrupted due to an incorrect filesystem shutdown. Unable to find this critical file, mongod stopped.
  • The last restart in the logs appears to have found a clean environment, without database files or replica set configuration

Without knowing more details about what exactly happened during these events is very difficult to piece the puzzle together, but it seems clear that mongod was not able to cope with the corruption created by the incident with the storage layer.

As for future recommendations, the MongoDB Operations Best Practices white paper is a good place to start.

Since the SERVER project is for reporting bugs or feature suggestions for the MongoDB server I'm going to close this ticket. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources.

Regards,
Ramón.

Generated at Thu Feb 08 03:53:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.