[SERVER-15387] Replica-set fails to elect primary after asymmetric network split blips Created: 24/Sep/14  Updated: 18/Feb/15  Resolved: 18/Feb/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.6.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Anil Kumar Assignee: Matt Dannenberg
Resolution: Cannot Reproduce Votes: 0
Labels: elections
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Operating System: ALL
Participants:

 Description   

In certain scenarios, the replica-set with priorities can end up in a state where a PRIMARY cannot be elected anymore until some election causing event occurs.

i.e. for data nodes A, B C and arbiter D, E (B & D being on same machine). Following sequence of events caused this to occur:

Priority = 1 for A, C
Priority = 0.5 for B

Timeline A-State B-State C-State Comments
T + 0 Primary Secondary Secondary  
T + 1 Not Reachable Primary Not Reachable 'A' and 'C' not reachable from 'E'-Arb, 'B' selected Primary
T + 2 Recovering Secondary Secondary  
T + 3 Primary Secondary Secondary 'B'p stepped down because of lower priority, 'E'-Arb not able to see any primary
T + 4 Not Reachable Secondary Not Reachable 'A', 'C' not reachable from 'B' and 'B'-Arb
T + 5 Not Reachable Primary Not Reachable 'B' elected Primary, since it was not reachable from 'B', 'D'-Arb, 'E'-Arb
T + 6 Secondary Primary Secondary 'A' relinqueshed Primary since 'B' was more recently elected Primary, 'A' syncing to 'C'
T + 7 Rollback Primary Secondary  
T + 8 Recovering Secondary Secondary 'A' while still Recovering steps downs 'B' that has lower priority and is only 4 seconds ahead of 'C'
T + 9 Secondary Secondary Secondary 'A', 'C' not electing because they are not freshest which implies very likely 'B' has the latest optime (since no rollback was seen when it was stepped down by 'A'). 'B' does not elect itself saying 'E'-Arb will veto for lower priority. 'B' is ahead by a few seconds of 'C'
T + XX Down Secondary Primary Shutting down 'A' causes sync target change for 'C' / rollback followed by a fresh election in the replica-set


 Comments   
Comment by Matt Dannenberg [ 18/Feb/15 ]

Without the missing pieces of information, we are having difficulty drawing any conclusions or reproducing this.

Feel free to reopen the ticket if this occurs again or the missing information is discovered.

Generated at Thu Feb 08 03:37:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.