[SERVER-5880] Server replication "hung" Created: 21/May/12  Updated: 08/Mar/13  Resolved: 29/Aug/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Aristarkh Zagorodnikov Assignee: Randolph Tan
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux 3.2.0-24-generic #37-Ubuntu SMP Wed Apr 25 08:43:22 UTC 2012 x86_64


Attachments: File secondary.log.gz    
Operating System: Linux
Participants:

 Description   

Recently, after a series of server reboots, one of our server stopped replicating data from primary. Everything in logs looks ok, except that it stalled in the "replset syncing to ..." phase after startup. Restarting either primary or secondary didn't help. I think that the simplest solution would be resyncing from scratch, but the database is large enough and write volume is intensive enough for full replication to take about 16-20 hours that we prefer not to spend right now.
MMS links follow.
Secondary in question: https://mms.10gen.com/host/detail/cfa77b8ec3d9ab5870a3a6892c43ba8f
Its primary: https://mms.10gen.com/host/detail/c070029ad468a0141a72471130b108df



 Comments   
Comment by Randolph Tan [ 29/Aug/12 ]

Found nothing that stood out from the secondary logs. Links to MMS are not working - even still, the event is too old that I doubt MMS will still have them. I suspect that the primary got too busy and starved the secondary from replicating.

Comment by Aristarkh Zagorodnikov [ 21/May/12 ]

I've attached log from the secondary. Tell me if you need the one from the primary.

Comment by Aristarkh Zagorodnikov [ 21/May/12 ]

Log from the secondary (old contnent truncated to preserve log to the moment it became secondary after a full resync).

Comment by Eliot Horowitz (Inactive) [ 21/May/12 ]

Can you send the full log?

Comment by Aristarkh Zagorodnikov [ 21/May/12 ]

Server successfullty continued replication after approximately one hour of waiting. While I consider this immediate problem to be resolved, I still think that there is something unhealthy in replication doing zero progress for such long periods of time.

Generated at Thu Feb 08 03:10:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.