[SERVER-9509] Replica Set SECONDARY Fails to Come Online Following Full Re-Sync Created: 30/Apr/13 Updated: 10/Dec/14 Resolved: 09/May/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.2, 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Adam Kirkton | Assignee: | Thomas Rueckstiess |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | replication | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
All three servers: Windows 2008 R2 64-bit, Dual Quad-Core Intel Xeon X3450, 4 GB , 183 GB Hard Drive |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Operating System: | Windows | ||||
| Participants: | |||||
| Description |
|
I have a simple replica set with a primary a secondary and an arbiter. I experienced a hardware failure a few days ago which required provisioning a brand new machine to replace one of the primary/secondary servers. I setup the server with the same IPs as the previous server and started up Mongo with an empty data directory to allow it to perform a full re-sync. Each time I tried the full re-sync it would successfully sync all the data but I believe it is at the point at which it attempts to apply the oplog before actually coming up as a secondary that it failed each time until I downgraded the secondary to 2.2.4. At that point (without having to fully re-sync again) everything came up as expected. I have attached PRIMARY and SECONDARY logs for the appropriate timeframes to show from start of sync through the failure. I also included the log of the SECONDARY after I downgraded to 2.2.4 and what it did following. Please let me know if I need to provide other information. |
| Comments |
| Comment by Adam Kirkton [ 08/May/13 ] | |||||||
|
Thanks for the info Thomas. After the initial sync completed and I re-upgraded everything has been working fine. I was afraid that there probably wasn't enough info to tell much of anything. I will definitely upgrade to 2.4.4 when that comes out. | |||||||
| Comment by Thomas Rueckstiess [ 08/May/13 ] | |||||||
|
Hi Adam, Thanks for reporting this issue. I've looked at the provided log files but couldn't find any conclusive reason as to why the initial sync failed on 2.4.3. It seems that after the cloning of the initial documents has completed, the secondary node is repeatedly reported to be DOWN for roughly 30 seconds a time by the primary:
This could be related to a bug affecting the Windows platform, Looking at MMS, it appears you are now running both nodes on 2.4.3 again and they are both healthy. Is this the case? Are you experiencing any problems at the moment? As your current version is still affected by Regards, |