Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-6387

Check before --repair touches data to verify the disk has enough space

    • Type: Icon: Improvement Improvement
    • Resolution: Won't Fix
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.0.4
    • Component/s: Admin, MMAPv1, Stability
    • Labels:
      None
    • Environment:
      Ubuntu 11.10, EC2
    • Storage Execution

      I recently ran a --repair on one of my secondaries and got this crash when the oplog was done building. My hunch is that the cause was not enough disk space on the volume that holds the oplog (50 GB partition for a 30 GB oplog... so just less than 2x space). Still, would have been nice if the case was handled gracefully (in this case, the oplog wasn't damaged to begin with) instead of failing & leaving the DB in a corrupt state.

      Tue Jul 10 14:31:10 [initandlisten] clone local.oplog.rs 31507327
      Tue Jul 10 14:31:17 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
      Tue Jul 10 14:31:21 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
      Tue Jul 10 14:31:25 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
      Tue Jul 10 14:31:29 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
      Tue Jul 10 14:31:33 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
      Tue Jul 10 14:31:35 [initandlisten] 32861623 objects cloned so far from collection local.oplog.rs
      Tue Jul 10 14:31:42 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
      Tue Jul 10 14:32:10 [initandlisten] clone local.oplog.rs 33154047
      Tue Jul 10 14:32:21 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
      Tue Jul 10 14:32:23 [initandlisten] old journal file will be removed: /data/journal/j._15
      Tue Jul 10 14:32:31 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
      Tue Jul 10 14:32:36 [initandlisten] 33745416 objects cloned so far from collection local.oplog.rs
      Tue Jul 10 14:33:03 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
      Tue Jul 10 14:33:10 [initandlisten] clone local.oplog.rs 33975423
      Tue Jul 10 14:33:24 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
      Tue Jul 10 14:33:37 [initandlisten] 34269926 objects cloned so far from collection local.oplog.rs
      Tue Jul 10 14:33:39 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms
      Tue Jul 10 14:33:43 [initandlisten] build index local.me { _id: 1 }
      Tue Jul 10 14:33:44 [initandlisten] build index done 1 records 0.023 secs
      Tue Jul 10 14:33:44 [initandlisten] build index local.replset.minvalid { _id: 1 }
      Tue Jul 10 14:33:44 [initandlisten] build index done 1 records 0 secs
      Tue Jul 10 14:33:44 [initandlisten] build index local.slaves { _id: 1 }
      Tue Jul 10 14:33:44 [initandlisten] build index done 1 records 0.001 secs
      Tue Jul 10 14:33:49 [initandlisten] Assertion: 14043:delete data files with a directoryperdb caught boost exception: boost::filesystem::remove: Device or resource busy: "/data/local"
      0x584102 0x8bbb38 0x8bcb49 0xa91ae8 0xa92d9a 0xa94874 0xa951ad 0xa9eeda 0x7f20448c630d 0x501409 
       /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x112) [0x584102]
       /usr/bin/mongod(_ZN5mongo16_deleteDataFilesEPKc+0x298) [0x8bbb38]
       /usr/bin/mongod(_ZN5mongo14repairDatabaseESsRSsbb+0x7b9) [0x8bcb49]
       /usr/bin/mongod(_ZN5mongo11doDBUpgradeERKSsSsPNS_14DataFileHeaderE+0x68) [0xa91ae8]
       /usr/bin/mongod() [0xa92d9a]
       /usr/bin/mongod(_ZN5mongo14_initAndListenEi+0x404) [0xa94874]
       /usr/bin/mongod(_ZN5mongo13initAndListenEi+0x1d) [0xa951ad]
       /usr/bin/mongod(main+0x9aaa) [0xa9eeda]
       /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f20448c630d]
       /usr/bin/mongod(__gxx_personality_v0+0x3f1) [0x501409]
      Tue Jul 10 14:33:49 [initandlisten] exception in initAndListen: 14043 delete data files with a directoryperdb caught boost exception: boost::filesystem::remove: Device or resource busy: "/data/local", terminating
      

      After restarting the server from that error, I got this. I'm guessing that's just a byproduct of the problems caused by the repair failing, but figured it could be helpful. After a third restart, it successfully connected to the replica set primary and began a resync.

      Tue Jul 10 14:39:33 [initandlisten] connection accepted from 10.170.90.116:50480 #101
      Tue Jul 10 14:39:34 [rsStart] trying to contact ec2-50-18-168-160.us-west-1.compute.amazonaws.com:27017
      Tue Jul 10 14:39:34 [rsStart] trying to contact ec2-50-18-235-86.us-west-1.compute.amazonaws.com:27017
      Tue Jul 10 14:39:34 Invalid access at address: 0xfffffff7
      
      Tue Jul 10 14:39:34 Got signal: 11 (Segmentation fault).
      
      Tue Jul 10 14:39:34 Backtrace:
      0xa90999 0xa90f70 0x7fb6904a4060 0x7fb6902314fb 0x7c80cd 0x7c9a47 0x7c9f3d 0xaab3e0 0x7fb69049befc 0x7fb68fa3589d 
       /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0xa90999]
       /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x220) [0xa90f70]
       /lib/x86_64-linux-gnu/libpthread.so.0(+0x10060) [0x7fb6904a4060]
       /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSsC1ERKSs+0xb) [0x7fb6902314fb]
       /usr/bin/mongod(_ZN5mongo11ReplSetImpl10loadConfigEv+0x32d) [0x7c80cd]
       /usr/bin/mongod(_ZN5mongo11ReplSetImplC2ERNS_14ReplSetCmdlineE+0x3f7) [0x7c9a47]
       /usr/bin/mongod(_ZN5mongo13startReplSetsEPNS_14ReplSetCmdlineE+0x5d) [0x7c9f3d]
       /usr/bin/mongod(thread_proxy+0x80) [0xaab3e0]
       /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7fb69049befc]
       /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fb68fa3589d]
      
      Logstream::get called in uninitialized state
      Tue Jul 10 14:39:34 ERROR: Client::shutdown not called: rsStart
      

            Assignee:
            backlog-server-execution [DO NOT USE] Backlog - Storage Execution Team
            Reporter:
            adamaflynn Adam Flynn
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: