[SERVER-3891] crash on slave replication Created: 16/Sep/11 Updated: 30/Mar/12 Resolved: 01/Nov/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | Pete Brumm | Assignee: | Spencer Brody (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | replicaset, replication, stale | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
windows 64bit san jumbo frames 48gb ram |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | Windows | ||||||||
| Participants: | |||||||||
| Description |
|
we have a replica set of 3. 1,2,3 1 was primary 2 and 3 had gotten stale. we shutdown 2,3 they started syncing. 3 finished fine in 15 min 2 took longer. crashed and started over. then finished. It took couple of hours to complete. here is the log for 2 with the crash at Fri Sep 16 10:30:59 Fri Sep 16 10:30:59 [conn15] command admin.$cmd command: { replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru01.colo.rrgroup.com:27017" } ntoreturn:1 reslen:125 0ms Fri Sep 16 10:30:59 [conn14] command admin.$cmd command: { replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru03.colo.rrgroup.com:27017" } ntoreturn:1 reslen:125 0ms Fri Sep 16 10:31:01 [conn15] command admin.$cmd command: { replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru01.colo.rrgroup.com:27017" } ntoreturn:1 reslen:125 0ms Fri Sep 16 10:31:01 [conn14] command admin.$cmd command: { replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru03.colo.rrgroup.com:27017" } ntoreturn:1 reslen:125 0ms Fri Sep 16 10:31:06 [conn15] run command admin.$cmd { replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru01.colo.rrgroup.com:27017" }Fri Sep 16 10:31:06 [conn14] command admin.$cmd command: { replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru03.colo.rrgroup.com:27017" } ntoreturn:1 reslen:125 0ms |
| Comments |
| Comment by Spencer Brody (Inactive) [ 01/Nov/11 ] |
|
That certainly could be the problem. If you haven't seen it already, you should consider setting up MMS: http://www.10gen.com/mongodb-monitoring-service. If MMS had been setup, we would be able to confirm if the cause had been a memory leak because we would have visibility into the amount of memory being consumed by mongodb over time. Since it seems that switching over to linux has made this go away, I'm going to resolve this issue. Feel free to reopen if the problem comes back. |
| Comment by Pete Brumm [ 01/Nov/11 ] |
|
the server didn't crash once we upgraded to 2.0 for 3751. but it also completed in a time that is suspect (too short). so I don't think it was resolved just different. For this one our environment has switched back to linux and so we don't have hardware available to reproduce this issue. The issue that could be the cause of this one is 3911 (memory leak) |
| Comment by Spencer Brody (Inactive) [ 31/Oct/11 ] |
|
You mentioned in |
| Comment by Pete Brumm [ 20/Sep/11 ] |
|
with 3751 the server didn't crash on the repair. it just didn't seem to do anything. |
| Comment by Spencer Brody (Inactive) [ 19/Sep/11 ] |
|
It looks like the root problem is probably the same as in |
| Comment by Pete Brumm [ 19/Sep/11 ] |
|
@ 10:31AM Faulting application name: mongod.exe, version: 0.0.0.0, time stamp: 0x4e6cc48f @ 11:04AM Faulting application name: mongod.exe, version: 0.0.0.0, time stamp: 0x4e6cc48f |
| Comment by Eliot Horowitz (Inactive) [ 17/Sep/11 ] |
|
Is there anything in the windows log? |