[SERVER-3891] crash on slave replication Created: 16/Sep/11  Updated: 30/Mar/12  Resolved: 01/Nov/11

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.0
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Pete Brumm Assignee: Spencer Brody (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: replicaset, replication, stale
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

windows 64bit san jumbo frames 48gb ram


Attachments: Zip Archive mongodb_monru02.zip    
Issue Links:
Related
related to SERVER-3751 mongodb crashing on repairDatabase Closed
Operating System: Windows
Participants:

 Description   

we have a replica set of 3. 1,2,3

1 was primary

2 and 3 had gotten stale.

we shutdown 2,3
deleted db contents and started back up

they started syncing. 3 finished fine in 15 min

2 took longer. crashed and started over.

then finished.

It took couple of hours to complete.

here is the log for 2 with the crash at Fri Sep 16 10:30:59

Fri Sep 16 10:30:59 [conn15] command admin.$cmd command:

{ replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru01.colo.rrgroup.com:27017" }

ntoreturn:1 reslen:125 0ms
Fri Sep 16 10:30:59 [conn14] run command admin.$cmd

{ replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru03.colo.rrgroup.com:27017" }

Fri Sep 16 10:30:59 [conn14] command admin.$cmd command:

{ replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru03.colo.rrgroup.com:27017" }

ntoreturn:1 reslen:125 0ms
Fri Sep 16 10:31:00 [websvr] User Assertion: 13142:timeout getting readlock
Fri Sep 16 10:31:00 [websvr] Socket http response send() errno:0 The operation completed successfully. 192.168.16.35:6254
Fri Sep 16 10:31:00 unhandled windows exception
Fri Sep 16 10:31:00 ec=0xe06d7363
Fri Sep 16 10:31:01 [conn15] run command admin.$cmd

{ replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru01.colo.rrgroup.com:27017" }

Fri Sep 16 10:31:01 [conn15] command admin.$cmd command:

{ replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru01.colo.rrgroup.com:27017" }

ntoreturn:1 reslen:125 0ms
Fri Sep 16 10:31:01 [conn14] run command admin.$cmd

{ replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru03.colo.rrgroup.com:27017" }

Fri Sep 16 10:31:01 [conn14] command admin.$cmd command:

{ replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru03.colo.rrgroup.com:27017" }

ntoreturn:1 reslen:125 0ms
Fri Sep 16 10:31:06 [initandlisten] connection accepted from 10.99.130.82:61792 #16
Fri Sep 16 10:31:06 [conn14] run command admin.$cmd

{ replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru03.colo.rrgroup.com:27017" }

Fri Sep 16 10:31:06 [conn15] run command admin.$cmd

{ replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru01.colo.rrgroup.com:27017" }

Fri Sep 16 10:31:06 [conn14] command admin.$cmd command:

{ replSetHeartbeat: "prod_rudy", v: 4, pv: 1, checkEmpty: false, from: "monru03.colo.rrgroup.com:27017" }

ntoreturn:1 reslen:125 0ms
Fri Sep 16 10:31:06 [conn15] command admin.$cmd command: { replSetHeartb



 Comments   
Comment by Spencer Brody (Inactive) [ 01/Nov/11 ]

That certainly could be the problem. If you haven't seen it already, you should consider setting up MMS: http://www.10gen.com/mongodb-monitoring-service. If MMS had been setup, we would be able to confirm if the cause had been a memory leak because we would have visibility into the amount of memory being consumed by mongodb over time.

Since it seems that switching over to linux has made this go away, I'm going to resolve this issue. Feel free to reopen if the problem comes back.

Comment by Pete Brumm [ 01/Nov/11 ]

the server didn't crash once we upgraded to 2.0 for 3751. but it also completed in a time that is suspect (too short). so I don't think it was resolved just different.

For this one our environment has switched back to linux and so we don't have hardware available to reproduce this issue.

The issue that could be the cause of this one is 3911 (memory leak)

Comment by Spencer Brody (Inactive) [ 31/Oct/11 ]

You mentioned in SERVER-3751 that upgrading to 2.0.0 fixed the bug you were seeing on repair database. Can you confirm the status of this ticket as well? Are you still seeing this occur, or has it gone away since upgrading to 2.0?

Comment by Pete Brumm [ 20/Sep/11 ]

with 3751 the server didn't crash on the repair. it just didn't seem to do anything.

Comment by Spencer Brody (Inactive) [ 19/Sep/11 ]

It looks like the root problem is probably the same as in SERVER-3751 as both have the same windows exception happening.

Comment by Pete Brumm [ 19/Sep/11 ]

@ 10:31AM

Faulting application name: mongod.exe, version: 0.0.0.0, time stamp: 0x4e6cc48f
Faulting module name: mongod.exe, version: 0.0.0.0, time stamp: 0x4e6cc48f
Exception code: 0x40000015
Fault offset: 0x0000000000287765
Faulting process id: 0x1230
Faulting application start time: 0x01cc7485a9e3e081
Faulting application path: c:\mongodb\bin\mongod.exe
Faulting module path: c:\mongodb\bin\mongod.exe
Report Id: 9aefa344-e07d-11e0-8256-14feb5dc56e8

@ 11:04AM

Faulting application name: mongod.exe, version: 0.0.0.0, time stamp: 0x4e6cc48f
Faulting module name: mongod.exe, version: 0.0.0.0, time stamp: 0x4e6cc48f
Exception code: 0x40000015
Fault offset: 0x0000000000287765
Faulting process id: 0x1230
Faulting application start time: 0x01cc7485a9e3e081
Faulting application path: c:\mongodb\bin\mongod.exe
Faulting module path: c:\mongodb\bin\mongod.exe
Report Id: 9aefa344-e07d-11e0-8256-14feb5dc56e8

Comment by Eliot Horowitz (Inactive) [ 17/Sep/11 ]

Is there anything in the windows log?

Generated at Thu Feb 08 03:04:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.