[SERVER-9509] Replica Set SECONDARY Fails to Come Online Following Full Re-Sync Created: 30/Apr/13  Updated: 10/Dec/14  Resolved: 09/May/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.2, 2.4.3
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Adam Kirkton Assignee: Thomas Rueckstiess
Resolution: Cannot Reproduce Votes: 0
Labels: replication
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

All three servers: Windows 2008 R2 64-bit, Dual Quad-Core Intel Xeon X3450, 4 GB , 183 GB Hard Drive


Attachments: Zip Archive documents-export-2013-04-29.zip    
Issue Links:
Depends
Operating System: Windows
Participants:

 Description   

I have a simple replica set with a primary a secondary and an arbiter. I experienced a hardware failure a few days ago which required provisioning a brand new machine to replace one of the primary/secondary servers.

I setup the server with the same IPs as the previous server and started up Mongo with an empty data directory to allow it to perform a full re-sync.

Each time I tried the full re-sync it would successfully sync all the data but I believe it is at the point at which it attempts to apply the oplog before actually coming up as a secondary that it failed each time until I downgraded the secondary to 2.2.4. At that point (without having to fully re-sync again) everything came up as expected.

I have attached PRIMARY and SECONDARY logs for the appropriate timeframes to show from start of sync through the failure. I also included the log of the SECONDARY after I downgraded to 2.2.4 and what it did following.

Please let me know if I need to provide other information.



 Comments   
Comment by Adam Kirkton [ 08/May/13 ]

Thanks for the info Thomas. After the initial sync completed and I re-upgraded everything has been working fine. I was afraid that there probably wasn't enough info to tell much of anything. I will definitely upgrade to 2.4.4 when that comes out.

Comment by Thomas Rueckstiess [ 08/May/13 ]

Hi Adam,

Thanks for reporting this issue. I've looked at the provided log files but couldn't find any conclusive reason as to why the initial sync failed on 2.4.3. It seems that after the cloning of the initial documents has completed, the secondary node is repeatedly reported to be DOWN for roughly 30 seconds a time by the primary:

Mon Apr 29 20:56:12.443 [rsHealthPoll] replSet member lfdb01.localnet:27017 is now in state DOWN
Mon Apr 29 20:56:30.430 [rsHealthPoll] replSet member lfdb01.localnet:27017 is now in state STARTUP2
Mon Apr 29 20:56:42.972 [rsHealthPoll] replSet member lfdb01.localnet:27017 is now in state DOWN
Mon Apr 29 20:57:00.959 [rsHealthPoll] replSet member lfdb01.localnet:27017 is now in state STARTUP2
Mon Apr 29 20:57:13.502 [rsHealthPoll] replSet member lfdb01.localnet:27017 is now in state DOWN
Mon Apr 29 20:57:31.488 [rsHealthPoll] replSet member lfdb01.localnet:27017 is now in state STARTUP2
...

This could be related to a bug affecting the Windows platform, SERVER-9242, which will be fixed in the upcoming version 2.4.4. If you look at the issues that are marked as "duplicate" for this ticket, you will see performance issues related to initial sync, mongodump and finds. It's possible that this is a related symptom where the secondary after initially cloning the documents took much longer for its heartbeats and thus got declared as being DOWN repeatedly.

Looking at MMS, it appears you are now running both nodes on 2.4.3 again and they are both healthy. Is this the case? Are you experiencing any problems at the moment?

As your current version is still affected by SERVER-9242, I recommend that you upgrade to 2.4.4 once it is released.

Regards,
Thomas

Generated at Thu Feb 08 03:20:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.