[SERVER-5880] Server replication "hung" Created: 21/May/12 Updated: 08/Mar/13 Resolved: 29/Aug/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Aristarkh Zagorodnikov | Assignee: | Randolph Tan |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux 3.2.0-24-generic #37-Ubuntu SMP Wed Apr 25 08:43:22 UTC 2012 x86_64 |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
Recently, after a series of server reboots, one of our server stopped replicating data from primary. Everything in logs looks ok, except that it stalled in the "replset syncing to ..." phase after startup. Restarting either primary or secondary didn't help. I think that the simplest solution would be resyncing from scratch, but the database is large enough and write volume is intensive enough for full replication to take about 16-20 hours that we prefer not to spend right now. |
| Comments |
| Comment by Randolph Tan [ 29/Aug/12 ] |
|
Found nothing that stood out from the secondary logs. Links to MMS are not working - even still, the event is too old that I doubt MMS will still have them. I suspect that the primary got too busy and starved the secondary from replicating. |
| Comment by Aristarkh Zagorodnikov [ 21/May/12 ] |
|
I've attached log from the secondary. Tell me if you need the one from the primary. |
| Comment by Aristarkh Zagorodnikov [ 21/May/12 ] |
|
Log from the secondary (old contnent truncated to preserve log to the moment it became secondary after a full resync). |
| Comment by Eliot Horowitz (Inactive) [ 21/May/12 ] |
|
Can you send the full log? |
| Comment by Aristarkh Zagorodnikov [ 21/May/12 ] |
|
Server successfullty continued replication after approximately one hour of waiting. While I consider this immediate problem to be resolved, I still think that there is something unhealthy in replication doing zero progress for such long periods of time. |