[SERVER-23773] Replica set primary unexpectedly steps down for lower priority secondary Created: 17/Apr/16 Updated: 16/May/16 Resolved: 16/May/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.2.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Linar Savion | Assignee: | Kelsey Schubert |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
The replica set's lower priority secondary that had network issues suddenly elects itself(even though oplog is 10min old), causing primary to step down for no apparent reason, This issue triggered a rollback once the main node received primary, causing data loss at a critical moment. The setup is 2 replicas (one with priority 1 and the other 0.5) and an arbiter node. It appears that the arbiter voted for the older secondary even though it is in the same local network as the higher priority node, arbiter show this line in the log:
build information:
|
| Comments |
| Comment by Kelsey Schubert [ 16/May/16 ] | ||||||||||||
|
Hi linar-jether, Thank you for your patience. I have carefully examined the logs and determined that this is expected behavior. In my response I'll be referring to the nodes by their aliases in the table below
I'll describe the election that occurs at 2016-04-14T14:18:52 since it has the most surrounding context. Other elections appear to have the same root cause. Due to network issues, Node2 successfully connects to the Arbiter, but not Node1. Consequently, when the Arbiter has to drop its pooled connection to Node1, Node2 starts an election since it has not seen a primary (Node1) for over 10 seconds. Since neither Node2 nor the Arbiter are currently connected Node1, Node2 wins the election. Eventually, Node1 is able to connect to Node2 and schedules a priority takeover. All members of a replica set must be able to connect to every other member of the set to support replication. Since this condition was not met, the failover occurred. Unfortunately, since MongoDB cannot predict these network conditions, it must always follow its replication protocol. For MongoDB-related support discussion please post on the mongodb-users group or Stack Overflow with the mongodb tag. Questions about troubleshooting network connectivity issues would involve more discussion are best posted on the mongodb-users group. Kind regards, | ||||||||||||
| Comment by Linar Savion [ 17/Apr/16 ] | ||||||||||||
|
May be related to the comments at the end of this issue: |