[SERVER-3751] mongodb crashing on repairDatabase Created: 02/Sep/11  Updated: 29/Feb/12  Resolved: 31/Jan/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.0-rc1
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Pete Brumm Assignee: Mathias Stearn
Resolution: Cannot Reproduce Votes: 0
Labels: crash, repair, replication
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

windows 64bit 24cpu 48gb ram san drive


Attachments: Zip Archive mongodb_logs_20110906.4-crash.zip     Zip Archive mongodb_monru02.zip     Text File mongodb_repairdatabase.log    
Issue Links:
Duplicate
Related
is related to SERVER-3891 crash on slave replication Closed
Operating System: Windows
Participants:

 Description   

we have a replica set of 3.
we can run repairdatabase on master 1 and it works fine
we stepdown 1
2 becomes master
when we attempt to repairdatabase on 2
it restarts

PRIMARY> db.repairDatabase();
{
"errmsg" : "exception: nextSafe():

{ $err: \"not master and slaveok=false\", code: 13435 }

",
"code" : 13106,
"ok" : 0
}
SECONDARY>

so it looks like it crashed and failed over to 1 again

looked at the logs and we see

Fri Sep 02 12:55:45 [conn14] warning: ClientCursor::yield can't unlock b/c of recursive lock ns: pr_blue_spruce.sessions top: { opid: 63, active: true, lockType: "write", waitingForLock: false, secs_running: 68, op: "query", ns: "pr_blue_spruce", query:

{ repairDatabase: 1.0 }

, client: "127.0.0.1:53667", desc: "conn", msg: "index: (3/3) btree-middle", numYields: 0 }
Fri Sep 02 12:55:45 [conn14] warning: ClientCursor::yield can't unlock b/c of recursive lock ns: pr_blue_spruce.sessions top: { opid: 63, active: true, lockType: "write", waitingForLock: false, secs_running: 68, op: "query", ns: "pr_blue_spruce", query:

{ repairDatabase: 1.0 }

, client: "127.0.0.1:53667", desc: "conn", msg: "index: (3/3) btree-middle", numYields: 0 }
Fri Sep 02 12:55:45 [conn14] warning: ClientCursor::yield can't unlock b/c of recursive lock ns: pr_blue_spruce.sessions top: { opid: 63, active: true, lockType: "write", waitingForLock: false, secs_running: 68, op: "query", ns: "pr_blue_spruce", query:

{ repairDatabase: 1.0 }

, client: "127.0.0.1:53667", desc: "conn", msg: "index: (3/3) btree-middle", numYields: 0 }

then it keeps failing on startup and we see this exception

Fri Sep 02 12:55:57 [websvr] User Assertion: 13142:timeout getting readlock
Fri Sep 02 12:55:57 [websvr] Socket http response send() errno:0 The operation completed successfully. 192.168.16.35:36451
Fri Sep 02 12:55:57 unhandled windows exception
Fri Sep 02 12:55:57 ec=0xe06d7363
Fri Sep 02 12:55:57 [conn14] external sort used : 4 files in 11 secs
Fri Sep 02 12:55:57 [conn14] New namespace: pr_blue_spruce.sessions.$id
Fri Sep 02 12:55:57 [conn14] allocating new extent for pr_blue_spruce.sessions.$id padding:1 lenWHdr: 8192
Fri Sep 02 12:55:57 [conn14] allocating new extent for pr_blue_spruce.sessions.$id padding:1 lenWHdr: 8192
Fri Sep 02 12:55:57 [conn14] allocating new extent for pr_blue_spruce.sessions.$id padding:1 lenWHdr: 8192
Fri Sep 02 12:55:57 [conn14] allocating new extent for pr_blue_spruce.sessions.$id padding:1 lenWHdr: 8192
Fri Sep 02 12:55:57 [conn14] allocating new extent for pr_blue_spruce.sessions.$id padding:1 lenWHdr: 8192
Fri Sep 02 12:55:58 [conn14] allocating new extent for pr_blue_spruce.sessions.$id padding:1 lenWHdr: 8192
Fri Sep 02 12:55:58 [conn16] run command admin.$cmd

{ replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru02.colo.rrgroup.com:27017" }

Fri Sep 02 12:55:58 [conn16] command admin.$cmd command:

{ replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru02.colo.rrgroup.com:27017" }

ntoreturn:1 reslen:125 0ms
Fri Sep 02 12:55:58 [conn14] allocating new extent for pr_blue_spruce.sessions.$id padding:1 lenWHdr: 8192

we have tried wiping the db folder for 2 and having it resync a few times but the error doesn't go away.



 Comments   
Comment by Pete Brumm [ 30/Jan/12 ]

I no longer have an environment to test/replicate this in.

Comment by Mathias Stearn [ 30/Jan/12 ]

Have you been able to replicate this with 2.0-final or one of the dot releases?

Comment by Pete Brumm [ 16/Sep/11 ]

also in the log. is a crash in replication that I created another ticket for.

SERVER-3891

Comment by Pete Brumm [ 16/Sep/11 ]

we made 2 master and sent repairDatabase command

It took 7 seconds in a 25gb db

Fri Sep 16 11:55:33 is when the command was sent.

this is the log

mongodb_monru02.zip

Comment by Eliot Horowitz (Inactive) [ 14/Sep/11 ]

Can you update the logs?
There were a number of changes.

Comment by Pete Brumm [ 14/Sep/11 ]

we upgraded to latest 2.0 release. and when we do a repairDatabase . the results return in seconds instead of the many minutes.

does mongodb doing something differently on repair?

Comment by Pete Brumm [ 09/Sep/11 ]

we will give it a shot. We won't be able to kick it off until monday though.

Comment by Eliot Horowitz (Inactive) [ 09/Sep/11 ]

Can you run with -vvvvvvv

Comment by Pete Brumm [ 09/Sep/11 ]

we can reproduce this pretty easily

if you have a debug build we could try it and report back

Comment by Pete Brumm [ 07/Sep/11 ]

disk space was not an issue

here is the crash from the windows system log

Faulting application name: mongod.exe, version: 0.0.0.0, time stamp: 0x4e607d02
Faulting module name: mongod.exe, version: 0.0.0.0, time stamp: 0x4e607d02
Exception code: 0x40000015
Fault offset: 0x0000000000287635
Faulting process id: 0xe2c
Faulting application start time: 0x01cc6cb7e2182d95
Faulting application path: c:\mongodb\bin\mongod.exe
Faulting module path: c:\mongodb\bin\mongod.exe
Report Id: 0b08ce59-d8ac-11e0-b471-180373f1d279

Comment by Eliot Horowitz (Inactive) [ 06/Sep/11 ]

Was there an error in the system log or anywhere else when it crashed?
In the log you send, it just seems to go away.
Also, can you verify there is enough disk space

Comment by Pete Brumm [ 06/Sep/11 ]

here are the logs with the crash.

mongodb_logs_20110906.4-crash.zip

it starts at 12:10 on 03.

the other logs are from other replica set boxes and start a little earlier.

we made 03 master and issued the repairDatabase command. it took a few minutes then crashed. and because of a missing file in the repair it continually restarts after first crash.

we also are using jumbo frames on our switches and run on a vlan

Comment by Eliot Horowitz (Inactive) [ 03/Sep/11 ]

I'm not seeing any errors in that log, are things working right now?

Comment by Pete Brumm [ 02/Sep/11 ]

this is a log of starting up 2 to sync from 1 and issuing to 1 a stepdown.

then we opened a console to 2 and issued the repairdatabase

I need to look at the logs. this may not show the restart. just the recursive lock. will reproduce more on tuesday

Comment by Eliot Horowitz (Inactive) [ 02/Sep/11 ]

Can you wipe the database again, and send the entire log from when it starts to resync till it hits an error.

Comment by Pete Brumm [ 02/Sep/11 ]

even after stopping 1 and copying files to 2 and restarting

after making 2 master

and running repairdatabase it returns this

          • SERVER RESTARTED *****

Fri Sep 02 13:45:56 BackgroundJob starting: DataFileSync
Fri Sep 02 13:45:56 [initandlisten] MongoDB starting : pid=6972 port=27017 dbpath=M:\mongodb\Data\db 64-bit host=MONRU02
Fri Sep 02 13:45:56 [initandlisten] db version v2.0.0-rc1, pdfile version 4.5
Fri Sep 02 13:45:56 [initandlisten] git version: b5dd192afa9ad59e45b660f7193e0d80c59dc6d5
Fri Sep 02 13:45:56 [initandlisten] build info: windows (6, 1, 7601, 2, 'Service Pack 1') BOOST_LIB_VERSION=1_42
Fri Sep 02 13:45:56 [initandlisten] options:

{ config: "c:\mongodb\bin\mongodb.conf", dbpath: "M:\mongodb\Data\db", journal: "true", logappend: "true", logpath: "M:\mongodb\logs\mongodb.log", oplogSize: 16384, pidfilepath: "C:\mongodb\bin\mongodb.pid", replSet: "prod_rudy", rest: "true", service: true, verbose: "true" }

Fri Sep 02 13:45:56 [initandlisten] journal dir=M:/mongodb/Data/db/journal
Fri Sep 02 13:45:56 [initandlisten] recover begin
Fri Sep 02 13:45:56 [initandlisten] recover lsn: 192033
Fri Sep 02 13:45:56 [initandlisten] recover M:/mongodb/Data/db/journal/j._2
Fri Sep 02 13:45:56 [initandlisten] recover skipping application of section seq:115311 < lsn:192033
Fri Sep 02 13:45:56 [initandlisten] recover skipping application of section seq:153781 < lsn:192033
Fri Sep 02 13:45:56 [initandlisten] recover M:/mongodb/Data/db/journal/j._3
Fri Sep 02 13:45:56 [initandlisten] exception during recovery
Fri Sep 02 13:45:56 [initandlisten] exception in initAndListen std::exception: boost::filesystem::file_size: The system cannot find the path specified: "M:\mongodb\Data\db\$tmp_repairDatabase_0\pr_blue_spruce.ns", terminating
Fri Sep 02 13:45:56 dbexit:
Fri Sep 02 13:45:56 [initandlisten] shutdown: going to close listening sockets...
Fri Sep 02 13:45:56 [initandlisten] shutdown: going to flush diaglog...
Fri Sep 02 13:45:56 [initandlisten] shutdown: going to close sockets...
Fri Sep 02 13:45:56 [initandlisten] shutdown: waiting for fs preallocator...
Fri Sep 02 13:45:56 [initandlisten] shutdown: lock for final commit...
Fri Sep 02 13:45:56 [initandlisten] shutdown: final commit...
Fri Sep 02 13:45:56 [initandlisten] shutdown: closing all files...
Fri Sep 02 13:45:56 [initandlisten] closeAllFiles() finished
Fri Sep 02 13:45:56 [initandlisten] shutdown: removing fs lock...
Fri Sep 02 13:45:56 dbexit: really exiting now

Generated at Thu Feb 08 03:03:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.