[SERVER-26924] Cannot start or --repair mongod because of unclean shutdown (due to running out of disk space) Created: 07/Nov/16  Updated: 14/Nov/16  Resolved: 14/Nov/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.2.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ali Punjani Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File WiredTiger     File WiredTiger.lock     File WiredTiger.turtle     File WiredTiger.wt     File WiredTigerLAS.wt     File mongorepair.py     File sizeStorer.wt    
Issue Links:
Duplicate
duplicates WT-2897 Checkpoints can become corrupted on f... Closed
Related
related to SERVER-19815 Improved mongod --repair option for W... Closed
Operating System: ALL
Participants:

 Description   

I have a small 1.7GB database that was running fine, but the drive it was on got filled up and so mongod crashed. Subsequently it's impossible to start mongod or to --repair, with the following error:

$ mongod --dbpath db --repair
2016-11-06T23:50:43.427-0500 I CONTROL  [initandlisten] MongoDB starting : pid=15289 port=27017 dbpath=db 64-bit host=cryoem
2016-11-06T23:50:43.432-0500 I CONTROL  [initandlisten] db version v3.2.9
2016-11-06T23:50:43.432-0500 I CONTROL  [initandlisten] git version: 22ec9e93b40c85fc7cae7d56e7d6a02fd811088c
2016-11-06T23:50:43.432-0500 I CONTROL  [initandlisten] allocator: tcmalloc
2016-11-06T23:50:43.432-0500 I CONTROL  [initandlisten] modules: none
2016-11-06T23:50:43.432-0500 I CONTROL  [initandlisten] build environment:
2016-11-06T23:50:43.432-0500 I CONTROL  [initandlisten]     distarch: x86_64
2016-11-06T23:50:43.432-0500 I CONTROL  [initandlisten]     target_arch: x86_64
2016-11-06T23:50:43.432-0500 I CONTROL  [initandlisten] options: { repair: true, storage: { dbPath: "db" } }
2016-11-06T23:50:43.518-0500 I -        [initandlisten] Detected data files in db created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2016-11-06T23:50:43.518-0500 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=37G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),,log=(enabled=false),
2016-11-06T23:50:43.535-0500 E STORAGE  [initandlisten] WiredTiger (0) [1478494243:535100][15289:0x7f652cdcccc0], file:WiredTiger.wt, connection: read checksum error for 4096B block at offset 24576: block header checksum of 1970325808 doesn't match expected checksum of 3338308183
2016-11-06T23:50:43.535-0500 E STORAGE  [initandlisten] WiredTiger (0) [1478494243:535188][15289:0x7f652cdcccc0], file:WiredTiger.wt, connection: WiredTiger.wt: encountered an illegal file format or internal value
2016-11-06T23:50:43.535-0500 E STORAGE  [initandlisten] WiredTiger (-31804) [1478494243:535214][15289:0x7f652cdcccc0], file:WiredTiger.wt, connection: the process must exit and restart: WT_PANIC: WiredTiger library panic
2016-11-06T23:50:43.535-0500 I -        [initandlisten] Fatal Assertion 28558
2016-11-06T23:50:43.535-0500 I -        [initandlisten]
 
***aborting after fassert() failure
 
 
2016-11-06T23:50:43.573-0500 F -        [initandlisten] Got signal: 6 (Aborted).
 
 0x1307192 0x13060b9 0x13068c2 0x7f652c097330 0x7f652bcf8c37 0x7f652bcfc028 0x1299422 0x108a733 0x1a7e13c 0x1a7e5bd 0x1a7e9a4 0x19a9c3c 0x19aa193 0x19a6e65 0x19ab1f9 0x19c888c 0x1a00700 0x1a7d087 0x1a7d5c9 0x1a7d6eb 0x1a0f2e8 0x1a7a225 0x1a43f0f 0x1a4400e 0x19fc4e3 0x1071baf 0x106dcf3 0xf91de8 0x95f6fe 0x961c90 0x964ead 0x7f652bce3f45 0x95c0b9
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"F07192","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"F060B9"},{"b":"400000","o":"F068C2"},{"b":"7F652C087000","o":"10330"},{"b":"7F652BCC2000","o":"36C37","s":"gsignal"},{"b":"7F652BCC2000","o":"3A028","s":"abort"},{"b":"400000","o":"E99422","s":"_ZN5mongo13fassertFailedEi"},{"b":"400000","o":"C8A733"},{"b":"400000","o":"167E13C","s":"__wt_eventv"},{"b":"400000","o":"167E5BD","s":"__wt_err"},{"b":"400000","o":"167E9A4","s":"__wt_panic"},{"b":"400000","o":"15A9C3C","s":"__wt_block_extlist_read"},{"b":"400000","o":"15AA193","s":"__wt_block_extlist_read_avail"},{"b":"400000","o":"15A6E65","s":"__wt_block_checkpoint_load"},{"b":"400000","o":"15AB1F9"},{"b":"400000","o":"15C888C","s":"__wt_btree_open"},{"b":"400000","o":"1600700","s":"__wt_conn_btree_open"},{"b":"400000","o":"167D087","s":"__wt_session_get_btree"},{"b":"400000","o":"167D5C9","s":"__wt_session_get_btree"},{"b":"400000","o":"167D6EB","s":"__wt_session_get_btree_ckpt"},{"b":"400000","o":"160F2E8","s":"__wt_curfile_open"},{"b":"400000","o":"167A225"},{"b":"400000","o":"1643F0F","s":"__wt_metadata_cursor_open"},{"b":"400000","o":"164400E","s":"__wt_metadata_cursor"},{"b":"400000","o":"15FC4E3","s":"wiredtiger_open"},{"b":"400000","o":"C71BAF","s":"_ZN5mongo18WiredTigerKVEngineC2ERKSsS2_S2_mbbb"},{"b":"400000","o":"C6DCF3"},{"b":"400000","o":"B91DE8","s":"_ZN5mongo20ServiceContextMongoD29initializeGlobalStorageEngineEv"},{"b":"400000","o":"55F6FE"},{"b":"400000","o":"561C90","s":"_ZN5mongo13initAndListenEi"},{"b":"400000","o":"564EAD","s":"main"},{"b":"7F652BCC2000","o":"21F45","s":"__libc_start_main"},{"b":"400000","o":"55C0B9"}],"processInfo":{ "mongodbVersion" : "3.2.9", "gitVersion" : "22ec9e93b40c85fc7cae7d56e7d6a02fd811088c", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.16.0-76-generic", "version" : "#98~14.04.1-Ubuntu SMP Fri Jun 24 17:04:54 UTC 2016", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFFB59F4000", "elfType" : 3 }, { "b" : "7F652C9C5000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3 }, { "b" : "7F652C7C1000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3 }, { "b" : "7F652C4BB000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3 }, { "b" : "7F652C2A5000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7F652C087000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3 }, { "b" : "7F652BCC2000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3 }, { "b" : "7F652CBCD000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x1307192]
 mongod(+0xF060B9) [0x13060b9]
 mongod(+0xF068C2) [0x13068c2]
 libpthread.so.0(+0x10330) [0x7f652c097330]
 libc.so.6(gsignal+0x37) [0x7f652bcf8c37]
 libc.so.6(abort+0x148) [0x7f652bcfc028]
 mongod(_ZN5mongo13fassertFailedEi+0x82) [0x1299422]
 mongod(+0xC8A733) [0x108a733]
 mongod(__wt_eventv+0x42C) [0x1a7e13c]
 mongod(__wt_err+0x8D) [0x1a7e5bd]
 mongod(__wt_panic+0x24) [0x1a7e9a4]
 mongod(__wt_block_extlist_read+0x6C) [0x19a9c3c]
 mongod(__wt_block_extlist_read_avail+0x33) [0x19aa193]
 mongod(__wt_block_checkpoint_load+0x3C5) [0x19a6e65]
 mongod(+0x15AB1F9) [0x19ab1f9]
 mongod(__wt_btree_open+0xC7C) [0x19c888c]
 mongod(__wt_conn_btree_open+0x140) [0x1a00700]
 mongod(__wt_session_get_btree+0xE7) [0x1a7d087]
 mongod(__wt_session_get_btree+0x629) [0x1a7d5c9]
 mongod(__wt_session_get_btree_ckpt+0xAB) [0x1a7d6eb]
 mongod(__wt_curfile_open+0x218) [0x1a0f2e8]
 mongod(+0x167A225) [0x1a7a225]
 mongod(__wt_metadata_cursor_open+0x5F) [0x1a43f0f]
 mongod(__wt_metadata_cursor+0x7E) [0x1a4400e]
 mongod(wiredtiger_open+0x1433) [0x19fc4e3]
 mongod(_ZN5mongo18WiredTigerKVEngineC2ERKSsS2_S2_mbbb+0x77F) [0x1071baf]
 mongod(+0xC6DCF3) [0x106dcf3]
 mongod(_ZN5mongo20ServiceContextMongoD29initializeGlobalStorageEngineEv+0x598) [0xf91de8]
 mongod(+0x55F6FE) [0x95f6fe]
 mongod(_ZN5mongo13initAndListenEi+0x10) [0x961c90]
 mongod(main+0x15D) [0x964ead]
 libc.so.6(__libc_start_main+0xF5) [0x7f652bce3f45]
 mongod(+0x55C0B9) [0x95c0b9]
-----  END BACKTRACE  -----
Aborted (core dumped)

I believe this means the WiredTiger.wt file is somehow corrupt, seems like there are a lot of other tickets like this and some kind of manual repair needs to be done. I'm attaching all the relavant non-collection/index files from the db, hopefully someone can help!
Thanks in advance.



 Comments   
Comment by Kelsey Schubert [ 08/Nov/16 ]

Hi alipunjani,

The issue you describe is likely caused by WT-2897, and I would recommend that you consider upgrading to MongoDB 3.2.10.

Thank you for taking the time to detail the steps you took to recover your data. I'm glad this resolved the issue for you, however please be aware that the steps you describe may lead to undefined states. We have an open ticket, SERVER-19815, to make the repair process with WiredTiger more robust. Please feel free to vote for it and watch it for updates.

Kind regards,
Thomas

Comment by Ali Punjani [ 08/Nov/16 ]

I managed to fix this issue myself through a lot of manual labour. For those interested, here's what I did. Based on:
http://www.alexbevi.com/blog/2016/02/10/recovering-a-wiredtiger-collection-from-a-corrupt-mongodb-installation/
https://jira.mongodb.org/browse/SERVER-23346
My database is part of a Meteor.js app, YMMV.

1) Instantiate or find a separate db that has the same collections as the corrupt db (call this the working db)
2) Instantiate a new empty mongodb called 'recoverdb' that will be used to store all the recovered collections
3) Use the python script mongorepair.py I wrote (attached) to copy over the WiredTiger.* files from the working db to the corrupt one, create a collection in the recoverdb for each collection.wt file in the corrupt db, use the wt command line tool to salvage, dump, and reload the collections from the corrupt db into the recoverdb's collection files
4) Now the recoverdb has all the collections salvaged, but they are named random names and need to be sorted by hand to figure out which was which from the original db. I use the renameCollection db.runCommand in mongo, after looking at Object.keys(db[collection-X-XXXXX.wt'].findOne()) to tell what the documents and thus collection is.
5) Now all the recovered data is back in mongo and usable.
6) profit

Comment by Ali Punjani [ 08/Nov/16 ]

Enjoy!

Comment by Ali Punjani [ 07/Nov/16 ]

Also, for further reference, here is the log from right before the crash showing what happened:

2016-11-05T20:07:10.159-0700 E STORAGE  [thread1] WiredTiger (28) [1478401630:159647][18726:0x7f19baac5700], file:collection-3--522493423219057413.wt, WT_SESSION.checkpoint: /home/cryosparc/cryosparc/cryosparc/run/db/collection-3--522493423219057413.wt: handle-write: pwrite: failed to write 61440 bytes at offset 1699704832: No space left on device
2016-11-05T20:07:10.159-0700 E STORAGE  [thread1] WiredTiger (28) [1478401630:159766][18726:0x7f19baac5700], checkpoint-server: checkpoint server error: No space left on device
2016-11-05T20:07:10.159-0700 E STORAGE  [thread1] WiredTiger (-31804) [1478401630:159785][18726:0x7f19baac5700], checkpoint-server: the process must exit and restart: WT_PANIC: WiredTiger library panic
2016-11-05T20:07:10.159-0700 I -        [thread1] Fatal Assertion 28558
2016-11-05T20:07:10.159-0700 I -        [thread1]

Generated at Thu Feb 08 04:13:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.