[SERVER-3751] mongodb crashing on repairDatabase Created: 02/Sep/11 Updated: 29/Feb/12 Resolved: 31/Jan/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.0-rc1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | Pete Brumm | Assignee: | Mathias Stearn |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | crash, repair, replication | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
windows 64bit 24cpu 48gb ram san drive |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | Windows | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
we have a replica set of 3. PRIMARY> db.repairDatabase(); ", so it looks like it crashed and failed over to 1 again looked at the logs and we see Fri Sep 02 12:55:45 [conn14] warning: ClientCursor::yield can't unlock b/c of recursive lock ns: pr_blue_spruce.sessions top: { opid: 63, active: true, lockType: "write", waitingForLock: false, secs_running: 68, op: "query", ns: "pr_blue_spruce", query: { repairDatabase: 1.0 }, client: "127.0.0.1:53667", desc: "conn", msg: "index: (3/3) btree-middle", numYields: 0 } , client: "127.0.0.1:53667", desc: "conn", msg: "index: (3/3) btree-middle", numYields: 0 } , client: "127.0.0.1:53667", desc: "conn", msg: "index: (3/3) btree-middle", numYields: 0 } then it keeps failing on startup and we see this exception Fri Sep 02 12:55:57 [websvr] User Assertion: 13142:timeout getting readlock Fri Sep 02 12:55:58 [conn16] command admin.$cmd command: { replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru02.colo.rrgroup.com:27017" } ntoreturn:1 reslen:125 0ms we have tried wiping the db folder for 2 and having it resync a few times but the error doesn't go away. |
| Comments |
| Comment by Pete Brumm [ 30/Jan/12 ] |
|
I no longer have an environment to test/replicate this in. |
| Comment by Mathias Stearn [ 30/Jan/12 ] |
|
Have you been able to replicate this with 2.0-final or one of the dot releases? |
| Comment by Pete Brumm [ 16/Sep/11 ] |
|
also in the log. is a crash in replication that I created another ticket for. |
| Comment by Pete Brumm [ 16/Sep/11 ] |
|
we made 2 master and sent repairDatabase command It took 7 seconds in a 25gb db Fri Sep 16 11:55:33 is when the command was sent. this is the log mongodb_monru02.zip |
| Comment by Eliot Horowitz (Inactive) [ 14/Sep/11 ] |
|
Can you update the logs? |
| Comment by Pete Brumm [ 14/Sep/11 ] |
|
we upgraded to latest 2.0 release. and when we do a repairDatabase . the results return in seconds instead of the many minutes. does mongodb doing something differently on repair? |
| Comment by Pete Brumm [ 09/Sep/11 ] |
|
we will give it a shot. We won't be able to kick it off until monday though. |
| Comment by Eliot Horowitz (Inactive) [ 09/Sep/11 ] |
|
Can you run with -vvvvvvv |
| Comment by Pete Brumm [ 09/Sep/11 ] |
|
we can reproduce this pretty easily if you have a debug build we could try it and report back |
| Comment by Pete Brumm [ 07/Sep/11 ] |
|
disk space was not an issue here is the crash from the windows system log Faulting application name: mongod.exe, version: 0.0.0.0, time stamp: 0x4e607d02 |
| Comment by Eliot Horowitz (Inactive) [ 06/Sep/11 ] |
|
Was there an error in the system log or anywhere else when it crashed? |
| Comment by Pete Brumm [ 06/Sep/11 ] |
|
here are the logs with the crash. mongodb_logs_20110906.4-crash.zip it starts at 12:10 on 03. the other logs are from other replica set boxes and start a little earlier. we made 03 master and issued the repairDatabase command. it took a few minutes then crashed. and because of a missing file in the repair it continually restarts after first crash. we also are using jumbo frames on our switches and run on a vlan |
| Comment by Eliot Horowitz (Inactive) [ 03/Sep/11 ] |
|
I'm not seeing any errors in that log, are things working right now? |
| Comment by Pete Brumm [ 02/Sep/11 ] |
|
this is a log of starting up 2 to sync from 1 and issuing to 1 a stepdown. then we opened a console to 2 and issued the repairdatabase I need to look at the logs. this may not show the restart. just the recursive lock. will reproduce more on tuesday |
| Comment by Eliot Horowitz (Inactive) [ 02/Sep/11 ] |
|
Can you wipe the database again, and send the entire log from when it starts to resync till it hits an error. |
| Comment by Pete Brumm [ 02/Sep/11 ] |
|
even after stopping 1 and copying files to 2 and restarting after making 2 master and running repairdatabase it returns this
Fri Sep 02 13:45:56 BackgroundJob starting: DataFileSync Fri Sep 02 13:45:56 [initandlisten] journal dir=M:/mongodb/Data/db/journal |