[SERVER-30836] The storage.bson file may become empty after startup Created: 25/Aug/17  Updated: 30/Oct/23  Resolved: 15/Sep/17

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 3.5.12
Fix Version/s: 3.6.0-rc0

Type: Bug Priority: Major - P3
Reporter: Jonathan Abrahams Assignee: Daniel Gottlieb (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File pc_val_WT-3.tar     File server30386.diff    
Issue Links:
Related
related to SERVER-60014 provide single utility library for fs... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Storage 2017-10-02
Participants:

 Description   

During the powercycle testing the following fatal assertion occurred:

081-0400 I CONTROL  [main] ***** SERVER RESTARTED *****
2017-08-25T13:39:12.286-0400 I CONTROL  [initandlisten] MongoDB starting : pid=1594 port=37017 dbpath=/home/jonathan/pc_val_WT-3/data/db 64-bit host=CAP-JA-Ubuntu14
2017-08-25T13:39:12.286-0400 I CONTROL  [initandlisten] db version v3.5.11-251-g3128427f77
2017-08-25T13:39:12.286-0400 I CONTROL  [initandlisten] git version: 3128427f77397916967e3eb684dfed845bb1638a
2017-08-25T13:39:12.286-0400 I CONTROL  [initandlisten] allocator: tcmalloc
2017-08-25T13:39:12.286-0400 I CONTROL  [initandlisten] modules: none
2017-08-25T13:39:12.286-0400 I CONTROL  [initandlisten] build environment:
2017-08-25T13:39:12.286-0400 I CONTROL  [initandlisten]     distarch: x86_64
2017-08-25T13:39:12.286-0400 I CONTROL  [initandlisten]     target_arch: x86_64
2017-08-25T13:39:12.286-0400 I CONTROL  [initandlisten] options: { net: { bindIp: "0.0.0.0", port: 37017 }, processManagement: { fork: true }, setParameter: { logLevel: "3" }, storage: { dbPath: "/home/jonathan/pc_val_WT-3/data/db", engine: "wiredTiger" }, systemLog: { destination: "file", logAppend: true, path: "/home/jonathan/pc_val_WT-3/log/mongod.log" } }
2017-08-25T13:39:12.305-0400 W -        [initandlisten] Detected unclean shutdown - /home/jonathan/pc_val_WT-3/data/db/mongod.lock is not empty.
2017-08-25T13:39:12.306-0400 E STORAGE  [initandlisten] Unable to read the storage engine metadata file: InvalidPath: Metadata file /home/jonathan/pc_val_WT-3/data/db/storage.bson cannot be empty.
2017-08-25T13:39:12.306-0400 F -        [initandlisten] Fatal Assertion 28661 at src/mongo/db/storage/storage_engine_metadata.cpp 76
2017-08-25T13:39:12.306-0400 F -        [initandlisten] 
 
***aborting after fassert() failure
 
 
2017-08-25T13:39:12.769-0400 F -        [initandlisten] Got signal: 6 (Aborted).
 
 0x7f1bdcbf2f11 0x7f1bdcbf2129 0x7f1bdcbf260d 0x7f1bd9e00330 0x7f1bd9a61c37 0x7f1bd9a65028 0x7f1bdb42bfef 0x7f1bdb9895d0 0x7f1bdb989758 0x7f1bdb69cfb1 0x7f1bdb3c619b 0x7f1bdb4987bc 0x7f1bdb42dfa9 0x7f1bd9a4cf45 0x7f1bdb488581
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"7F1BDAB5C000","o":"2096F11","s":"_ZN5mongo15printStackTraceERSo"},{"b":"7F1BDAB5C000","o":"2096129"},{"b":"7F1BDAB5C000","o":"209660D"},{"b":"7F1BD9DF0000","o":"10330"},{"b":"7F1BD9A2B000","o":"36C37","s":"gsignal"},{"b":"7F1BD9A2B000","o":"3A028","s":"abort"},{"b":"7F1BDAB5C000","o":"8CFFEF","s":"_ZN5mongo32fassertFailedNoTraceWithLocationEiPKcj"},{"b":"7F1BDAB5C000","o":"E2D5D0","s":"_ZN5mongo21StorageEngineMetadata7forPathERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE"},{"b":"7F1BDAB5C000","o":"E2D758","s":"_ZN5mongo21StorageEngineMetadata23getStorageEngineForPathERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE"},{"b":"7F1BDAB5C000","o":"B40FB1","s":"_ZN5mongo20ServiceContextMongoD29initializeGlobalStorageEngineEv"},{"b":"7F1BDAB5C000","o":"86A19B"},{"b":"7F1BDAB5C000","o":"93C7BC","s":"_ZN5mongo11mongoDbMainEiPPcS1_"},{"b":"7F1BDAB5C000","o":"8D1FA9","s":"main"},{"b":"7F1BD9A2B000","o":"21F45","s":"__libc_start_main"},{"b":"7F1BDAB5C000","o":"92C581"}],"processInfo":{ "mongodbVersion" : "3.5.11-251-g3128427f77", "gitVersion" : "3128427f77397916967e3eb684dfed845bb1638a", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.16.0-31-generic", "version" : "#43~14.04.1-Ubuntu SMP Tue Mar 10 20:13:38 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "b" : "7F1BDAB5C000", "elfType" : 3, "buildId" : "B10A43F8CCC8E0D4E054EDBA11F9129CFF7FF704" }, { "b" : "7FFF99AF1000", "elfType" : 3, "buildId" : "5552B9335DDE9349419BA10896C1E75C9432A946" }, { "b" : "7F1BDA72F000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "E2A6DD5048A0A051FD61043BDB69D8CC68192AB7" }, { "b" : "7F1BDA52B000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "DA9B8C234D0FE9FD8CAAC8970A7EC1B6C8F6623F" }, { "b" : "7F1BDA225000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "D144258E614900B255A31F3FD2283A878670D5BC" }, { "b" : "7F1BDA00E000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "E4AFDDDF7C56BCC9B8DD5EBD99DA89B02C17F7FD" }, { "b" : "7F1BD9DF0000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "31E9F21AE8C10396171F1E13DA15780986FA696C" }, { "b" : "7F1BD9A2B000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "CF699A15CAAE64F50311FC4655B86DC39A479789" }, { "b" : "7F1BDA937000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "D0F537904076D73F29E4A37341F8A449E2EF6CD0" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x7f1bdcbf2f11]
 mongod(+0x2096129) [0x7f1bdcbf2129]
 mongod(+0x209660D) [0x7f1bdcbf260d]
 libpthread.so.0(+0x10330) [0x7f1bd9e00330]
 libc.so.6(gsignal+0x37) [0x7f1bd9a61c37]
 libc.so.6(abort+0x148) [0x7f1bd9a65028]
 mongod(_ZN5mongo32fassertFailedNoTraceWithLocationEiPKcj+0x0) [0x7f1bdb42bfef]
 mongod(_ZN5mongo21StorageEngineMetadata7forPathERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x2E0) [0x7f1bdb9895d0]
 mongod(_ZN5mongo21StorageEngineMetadata23getStorageEngineForPathERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x38) [0x7f1bdb989758]
 mongod(_ZN5mongo20ServiceContextMongoD29initializeGlobalStorageEngineEv+0x91) [0x7f1bdb69cfb1]
 mongod(+0x86A19B) [0x7f1bdb3c619b]
 mongod(_ZN5mongo11mongoDbMainEiPPcS1_+0x86C) [0x7f1bdb4987bc]
 mongod(main+0x9) [0x7f1bdb42dfa9]
 libc.so.6(__libc_start_main+0xF5) [0x7f1bd9a4cf45]
 mongod(+0x92C581) [0x7f1bdb488581]
-----  END BACKTRACE  -----



 Comments   
Comment by Ramon Fernandez Marina [ 15/Sep/17 ]

Author:

{'username': u'dgottlieb', 'name': u'Daniel Gottlieb', 'email': u'daniel.gottlieb@mongodb.com'}

Message:SERVER-30836: Dont rely on boost's file rename functionality for necessary fsyncing on POSIX
Branch:master
https://github.com/mongodb/mongo/commit/33908bf59707e36b43ac58b50da4bde3e9cbae84

Comment by Jonathan Abrahams [ 13/Sep/17 ]

The provided patch does protect the storage,bson from becoming empty during system crashes.

Comment by Jonathan Abrahams [ 12/Sep/17 ]

It seems this ticket has morphed, as the original issue of asserting when the storage.bson is empty is "works as designed".

The secondary issue of the file becoming empty is the issue of interest and the ticket is being retitled.

Comment by Daniel Gottlieb (Inactive) [ 11/Sep/17 ]

Thanks for the logs. It seems there are two issues going on that I want to discuss separately.
1) Starting up mongod with an empty storage.bson file is a crash. It seems reasonable for this case to have the same behavior as the storage.bson file not existing. At the very least it shouldn't crash but result in an appropriate error message. With regards to recreating the storage.bson file when it does not exist: I'm not sure how that does the right thing (I'm skeptical that's by accident from the upgrade process when storage.bson files were introduced), so I think any work that has the empty storage.bson behave similarly should evaluate that the code.

2) The issue I'm more concerned with is a mongod that can come up and after a process/node crash and the storage.bson file not being intact. From our conversation of when the crash occurs, the crash is only issued after the server starts up and a client can connect. It's my expectation that the storage.bson file is safely persisted on disk at this time. I've found that our creation of the storage.bson file, which is done via a rename, doesn't do all of the "tricks" POSIX renames do[1][2]. I've attached a patch that better conforms to what I've seen other implementations do. Can you try running the patched version by hand and see if the problem becomes less frequent (or hopefully just goes away)?

[1]https://github.com/wiredtiger/wiredtiger/blob/95d911ab246e444192f34dc395652dba2653dd3c/src/os_posix/os_fs.c#L214-L223
[2]https://github.com/HewlettPackard/sparkle/blob/b0ae7cc87158ba5941616ecd16aec2c45aba9157/alps/src/common/os.hh#L104-L119

Comment by Jonathan Abrahams [ 08/Sep/17 ]

Note, if the storage.bson file does not exist then mongod will create one.

Comment by Jonathan Abrahams [ 08/Sep/17 ]

Steps to reproduce:

  • Start mongod
  • Stop mongod
  • Create empty storage.bson:
    • rm storage.bson
    • touch storage.bson
  • Start mongod
Comment by Jonathan Abrahams [ 07/Sep/17 ]

Attached tar file has data and logs.

Comment by Eric Milkie [ 07/Sep/17 ]

jonathan.abrahams I'm tabling work on this until it reproduces again and we can get log and data files to help diagnose further.

Comment by Daniel Gottlieb (Inactive) [ 01/Sep/17 ]

jonathan.abrahams Can you attach the entire dbpath after one of these failures? And the mongod log files as that seems to be in a separate path.

Comment by Ian Whalen (Inactive) [ 01/Sep/17 ]

Assigning to Dan to investigate a bit further.

Generated at Thu Feb 08 04:25:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.