Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Won't Fix
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.0.4
Component/s: Admin, MMAPv1, Stability
Labels:
None
Environment:
Ubuntu 11.10, EC2

Assigned Teams:

Storage Execution

I recently ran a --repair on one of my secondaries and got this crash when the oplog was done building. My hunch is that the cause was not enough disk space on the volume that holds the oplog (50 GB partition for a 30 GB oplog... so just less than 2x space). Still, would have been nice if the case was handled gracefully (in this case, the oplog wasn't damaged to begin with) instead of failing & leaving the DB in a corrupt state.

Tue Jul 10 14:31:10 [initandlisten] clone local.oplog.rs 31507327
Tue Jul 10 14:31:17 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:31:21 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:31:25 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:31:29 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:31:33 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:31:35 [initandlisten] 32861623 objects cloned so far from collection local.oplog.rs
Tue Jul 10 14:31:42 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:32:10 [initandlisten] clone local.oplog.rs 33154047
Tue Jul 10 14:32:21 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:32:23 [initandlisten] old journal file will be removed: /data/journal/j._15
Tue Jul 10 14:32:31 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:32:36 [initandlisten] 33745416 objects cloned so far from collection local.oplog.rs
Tue Jul 10 14:33:03 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:33:10 [initandlisten] clone local.oplog.rs 33975423
Tue Jul 10 14:33:24 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:33:37 [initandlisten] 34269926 objects cloned so far from collection local.oplog.rs
Tue Jul 10 14:33:39 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
Tue Jul 10 14:33:43 [initandlisten] build index local.me { _id: 1 }
Tue Jul 10 14:33:44 [initandlisten] build index done 1 records 0.023 secs
Tue Jul 10 14:33:44 [initandlisten] build index local.replset.minvalid { _id: 1 }
Tue Jul 10 14:33:44 [initandlisten] build index done 1 records 0 secs
Tue Jul 10 14:33:44 [initandlisten] build index local.slaves { _id: 1 }
Tue Jul 10 14:33:44 [initandlisten] build index done 1 records 0.001 secs
Tue Jul 10 14:33:49 [initandlisten] Assertion: 14043:delete data files with a directoryperdb caught boost exception: boost::filesystem::remove: Device or resource busy: "/data/local"
0x584102 0x8bbb38 0x8bcb49 0xa91ae8 0xa92d9a 0xa94874 0xa951ad 0xa9eeda 0x7f20448c630d 0x501409 
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x112) [0x584102]
 /usr/bin/mongod(_ZN5mongo16_deleteDataFilesEPKc+0x298) [0x8bbb38]
 /usr/bin/mongod(_ZN5mongo14repairDatabaseESsRSsbb+0x7b9) [0x8bcb49]
 /usr/bin/mongod(_ZN5mongo11doDBUpgradeERKSsSsPNS_14DataFileHeaderE+0x68) [0xa91ae8]
 /usr/bin/mongod() [0xa92d9a]
 /usr/bin/mongod(_ZN5mongo14_initAndListenEi+0x404) [0xa94874]
 /usr/bin/mongod(_ZN5mongo13initAndListenEi+0x1d) [0xa951ad]
 /usr/bin/mongod(main+0x9aaa) [0xa9eeda]
 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f20448c630d]
 /usr/bin/mongod(__gxx_personality_v0+0x3f1) [0x501409]
Tue Jul 10 14:33:49 [initandlisten] exception in initAndListen: 14043 delete data files with a directoryperdb caught boost exception: boost::filesystem::remove: Device or resource busy: "/data/local", terminating

After restarting the server from that error, I got this. I'm guessing that's just a byproduct of the problems caused by the repair failing, but figured it could be helpful. After a third restart, it successfully connected to the replica set primary and began a resync.

Tue Jul 10 14:39:33 [initandlisten] connection accepted from 10.170.90.116:50480 #101
Tue Jul 10 14:39:34 [rsStart] trying to contact ec2-50-18-168-160.us-west-1.compute.amazonaws.com:27017
Tue Jul 10 14:39:34 [rsStart] trying to contact ec2-50-18-235-86.us-west-1.compute.amazonaws.com:27017
Tue Jul 10 14:39:34 Invalid access at address: 0xfffffff7

Tue Jul 10 14:39:34 Got signal: 11 (Segmentation fault).

Tue Jul 10 14:39:34 Backtrace:
0xa90999 0xa90f70 0x7fb6904a4060 0x7fb6902314fb 0x7c80cd 0x7c9a47 0x7c9f3d 0xaab3e0 0x7fb69049befc 0x7fb68fa3589d 
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0xa90999]
 /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x220) [0xa90f70]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x10060) [0x7fb6904a4060]
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSsC1ERKSs+0xb) [0x7fb6902314fb]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl10loadConfigEv+0x32d) [0x7c80cd]
 /usr/bin/mongod(_ZN5mongo11ReplSetImplC2ERNS_14ReplSetCmdlineE+0x3f7) [0x7c9a47]
 /usr/bin/mongod(_ZN5mongo13startReplSetsEPNS_14ReplSetCmdlineE+0x5d) [0x7c9f3d]
 /usr/bin/mongod(thread_proxy+0x80) [0xaab3e0]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7fb69049befc]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fb68fa3589d]

Logstream::get called in uninitialized state
Tue Jul 10 14:39:34 ERROR: Client::shutdown not called: rsStart

is related to

SERVER-3759 filesystem ops may cause termination when no space left on device

Closed

Assignee:: [DO NOT USE] Backlog - Storage Execution Team

Reporter:: Adam Flynn

Participants:: [DO NOT USE] Backlog - Storage Execution Team, Adam Flynn, Eliot Horowitz

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: Jul 10 2012 09:46:37 PM UTC

Updated:: Dec 06 2022 05:31:39 AM UTC

Resolved:: Sep 14 2018 08:14:01 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates