[SERVER-6387] Check before --repair touches data to verify the disk has enough space Created: 10/Jul/12  Updated: 06/Dec/22  Resolved: 14/Sep/18

Status: Closed
Project: Core Server
Component/s: Admin, MMAPv1, Stability
Affects Version/s: 2.0.4
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Adam Flynn Assignee: Backlog - Storage Execution Team
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 11.10, EC2


Issue Links:
Related
is related to SERVER-3759 filesystem ops may cause termination ... Closed
Assigned Teams:
Storage Execution
Participants:

 Description   

I recently ran a --repair on one of my secondaries and got this crash when the oplog was done building. My hunch is that the cause was not enough disk space on the volume that holds the oplog (50 GB partition for a 30 GB oplog... so just less than 2x space). Still, would have been nice if the case was handled gracefully (in this case, the oplog wasn't damaged to begin with) instead of failing & leaving the DB in a corrupt state.

Tue Jul 10 14:31:10 [initandlisten] clone local.oplog.rs 31507327
Tue Jul 10 14:31:17 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:31:21 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:31:25 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:31:29 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:31:33 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:31:35 [initandlisten] 32861623 objects cloned so far from collection local.oplog.rs
Tue Jul 10 14:31:42 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:32:10 [initandlisten] clone local.oplog.rs 33154047
Tue Jul 10 14:32:21 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:32:23 [initandlisten] old journal file will be removed: /data/journal/j._15
Tue Jul 10 14:32:31 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:32:36 [initandlisten] 33745416 objects cloned so far from collection local.oplog.rs
Tue Jul 10 14:33:03 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:33:10 [initandlisten] clone local.oplog.rs 33975423
Tue Jul 10 14:33:24 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:33:37 [initandlisten] 34269926 objects cloned so far from collection local.oplog.rs
Tue Jul 10 14:33:39 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:33:43 [initandlisten] build index local.me { _id: 1 }
Tue Jul 10 14:33:44 [initandlisten] build index done 1 records 0.023 secs
Tue Jul 10 14:33:44 [initandlisten] build index local.replset.minvalid { _id: 1 }
Tue Jul 10 14:33:44 [initandlisten] build index done 1 records 0 secs
Tue Jul 10 14:33:44 [initandlisten] build index local.slaves { _id: 1 }
Tue Jul 10 14:33:44 [initandlisten] build index done 1 records 0.001 secs
Tue Jul 10 14:33:49 [initandlisten] Assertion: 14043:delete data files with a directoryperdb caught boost exception: boost::filesystem::remove: Device or resource busy: "/data/local"
0x584102 0x8bbb38 0x8bcb49 0xa91ae8 0xa92d9a 0xa94874 0xa951ad 0xa9eeda 0x7f20448c630d 0x501409 
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x112) [0x584102]
 /usr/bin/mongod(_ZN5mongo16_deleteDataFilesEPKc+0x298) [0x8bbb38]
 /usr/bin/mongod(_ZN5mongo14repairDatabaseESsRSsbb+0x7b9) [0x8bcb49]
 /usr/bin/mongod(_ZN5mongo11doDBUpgradeERKSsSsPNS_14DataFileHeaderE+0x68) [0xa91ae8]
 /usr/bin/mongod() [0xa92d9a]
 /usr/bin/mongod(_ZN5mongo14_initAndListenEi+0x404) [0xa94874]
 /usr/bin/mongod(_ZN5mongo13initAndListenEi+0x1d) [0xa951ad]
 /usr/bin/mongod(main+0x9aaa) [0xa9eeda]
 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f20448c630d]
 /usr/bin/mongod(__gxx_personality_v0+0x3f1) [0x501409]
Tue Jul 10 14:33:49 [initandlisten] exception in initAndListen: 14043 delete data files with a directoryperdb caught boost exception: boost::filesystem::remove: Device or resource busy: "/data/local", terminating

After restarting the server from that error, I got this. I'm guessing that's just a byproduct of the problems caused by the repair failing, but figured it could be helpful. After a third restart, it successfully connected to the replica set primary and began a resync.

Tue Jul 10 14:39:33 [initandlisten] connection accepted from 10.170.90.116:50480 #101
Tue Jul 10 14:39:34 [rsStart] trying to contact ec2-50-18-168-160.us-west-1.compute.amazonaws.com:27017
Tue Jul 10 14:39:34 [rsStart] trying to contact ec2-50-18-235-86.us-west-1.compute.amazonaws.com:27017
Tue Jul 10 14:39:34 Invalid access at address: 0xfffffff7
 
Tue Jul 10 14:39:34 Got signal: 11 (Segmentation fault).
 
Tue Jul 10 14:39:34 Backtrace:
0xa90999 0xa90f70 0x7fb6904a4060 0x7fb6902314fb 0x7c80cd 0x7c9a47 0x7c9f3d 0xaab3e0 0x7fb69049befc 0x7fb68fa3589d 
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0xa90999]
 /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x220) [0xa90f70]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x10060) [0x7fb6904a4060]
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSsC1ERKSs+0xb) [0x7fb6902314fb]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl10loadConfigEv+0x32d) [0x7c80cd]
 /usr/bin/mongod(_ZN5mongo11ReplSetImplC2ERNS_14ReplSetCmdlineE+0x3f7) [0x7c9a47]
 /usr/bin/mongod(_ZN5mongo13startReplSetsEPNS_14ReplSetCmdlineE+0x5d) [0x7c9f3d]
 /usr/bin/mongod(thread_proxy+0x80) [0xaab3e0]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7fb69049befc]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fb68fa3589d]
 
Logstream::get called in uninitialized state
Tue Jul 10 14:39:34 ERROR: Client::shutdown not called: rsStart



 Comments   
Comment by Adam Flynn [ 17/Sep/12 ]

Yeah, the problem was that the volume ran out of space disk space part-way through the repair. Fairly minor issue and partly my fault for not double checking I had 2x space available on all my volumes (was good on data volume, but not on oplog volume), but the problem allowed a transition from valid DB that I was trying to compact => corrupt DB and I had to do a full resync.

Would have been nice if there was a quick check before repair touches any data to verify the disk has enough space before starting, but not that big a deal.

Comment by Eliot Horowitz (Inactive) [ 16/Sep/12 ]

I know its been a while, but it looks like a hardware issue.
Can you check?

Generated at Thu Feb 08 03:11:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.