[SERVER-5393] Killing mongod during a repair can leave it unable to start up Created: 24/Mar/12  Updated: 11/Jul/16  Resolved: 01/Jun/12

Status: Closed
Project: Core Server
Component/s: Stability, Storage
Affects Version/s: None
Fix Version/s: 2.1.2

Type: Bug Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Mathias Stearn
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Operating System: ALL
Participants:

 Description   

Hard killing a mongod process during a repair at the wrong time can leave it unable to startup.

Steps to repro:

  • Create a database with lots of large documents (in my test I inserted ~15,000 1MB documents)
  • Run a repair, and after it has been running for some time kill the process with kill -9
  • Try to start up mongod normally
    I see the following message:

mongod --port 12345
Sat Mar 24 13:12:38 [initandlisten] MongoDB starting : pid=16292 port=12345 dbpath=/data/db/ 64-bit host=Spencer-MacBook.local
Sat Mar 24 13:12:38 [initandlisten] db version v2.0.2, pdfile version 4.5
Sat Mar 24 13:12:38 [initandlisten] git version: 514b122d308928517f5841888ceaa4246a7f18e3
Sat Mar 24 13:12:38 [initandlisten] build info: Darwin Spencer-MacBook.local 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun  7 16:32:41 PDT 2011; root:xnu-1504.15.3~1/RELEASE_X86_64 x86_64 BOOST_LIB_VERSION=1_47
Sat Mar 24 13:12:38 [initandlisten] options: { port: 12345 }
Sat Mar 24 13:12:38 [initandlisten] journal dir=/data/db/journal
Sat Mar 24 13:12:38 [initandlisten] recover begin
Sat Mar 24 13:12:38 [initandlisten] recover lsn: 0
Sat Mar 24 13:12:38 [initandlisten] recover /data/db/journal/j._3
Sat Mar 24 13:12:38 [initandlisten] exception during recovery
Sat Mar 24 13:12:38 [initandlisten] exception in initAndListen std::exception: boost::filesystem::file_size: No such file or directory: "/data/db/$tmp_repairDatabase_1/test.ns", terminating
Sat Mar 24 13:12:38 dbexit:
Sat Mar 24 13:12:38 [initandlisten] shutdown: going to close listening sockets...
Sat Mar 24 13:12:38 [initandlisten] shutdown: going to flush diaglog...
Sat Mar 24 13:12:38 [initandlisten] shutdown: going to close sockets...
Sat Mar 24 13:12:38 [initandlisten] shutdown: waiting for fs preallocator...
Sat Mar 24 13:12:38 [initandlisten] shutdown: lock for final commit...
Sat Mar 24 13:12:38 [initandlisten] shutdown: final commit...
Sat Mar 24 13:12:38 [initandlisten] shutdown: closing all files...
Sat Mar 24 13:12:38 [initandlisten] closeAllFiles() finished
Sat Mar 24 13:12:38 [initandlisten] shutdown: removing fs lock...
Sat Mar 24 13:12:38 dbexit: really exiting now

Note that I wasn't able to reproduce this consistently, I had to run the repair and kill several times to get it to happen.



 Comments   
Comment by auto [ 01/Jun/12 ]

Author:

{u'login': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: Move call to syncDataAndTruncateJournal higher in repair SERVER-5393

Previously we could have exited with the journal reffering to files that
had been deleted. By truncating the journal earlier we can avoid that.
The call to flushAll was also moved to make intent clearer to future
maintainers.
Branch: master
https://github.com/mongodb/mongo/commit/ca02fab3d15f1075235cf80d271e8a2c77bf1217

Comment by Spencer Brody (Inactive) [ 24/Mar/12 ]

I uploaded the data files for a db that won't start up after being killed during corruption.

scp -P 722 corruptedRepair.tar.gz spencerCorruptedRepair@www.10gen.com:

Generated at Thu Feb 08 03:08:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.