[SERVER-21744] Clients may fail to discover new primaries when clock skew between nodes is greater than electionTimeout Created: 02/Dec/15 Updated: 25/Jan/17 Resolved: 13/Jan/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.2.0-rc5 |
| Fix Version/s: | 3.2.3, 3.3.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matt Dannenberg | Assignee: | Siyuan Zhou |
| Resolution: | Done | Votes: | 0 |
| Labels: | code-and-test | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Minor Change | ||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Backport Completed: | |||||||||||||||||||||||||||||
| Sprint: | Repl D (12/11/15), Repl E (01/08/16), Repl F (01/29/16) | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
Assume there exist two nodes in a set (A and B) and that node A's clock is X seconds ahead of node B's clock. If node A is elected and then node B is elected within X seconds of node A being elected, node B's electionId will be less than node A's electionId, since it happened "earlier." |
| Comments |
| Comment by Githook User [ 12/Jan/16 ] |
|
Author: {u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}Message: Reset election id on PV upgrade and downgrade. (cherry picked from commit 1c28e37982441275cc127853985b30f2c6e74ff5) |
| Comment by Githook User [ 11/Jan/16 ] |
|
Author: {u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}Message: Reset election id on PV upgrade and downgrade. |
| Comment by Eric Milkie [ 04/Dec/15 ] |
|
A refinement to my idea: we can update the electionId twice, to avoid the increase in failover time. Immediately after being elected, a node can set the electionId time to be the time of the last committed op it currently has. Then, when it succeeds in committing its first op written, it can update the electionId time again. |
| Comment by Matt Dannenberg [ 04/Dec/15 ] |
|
david.golden A potential flaw with your proposed solution:
|
| Comment by Eric Milkie [ 04/Dec/15 ] |
|
Jeff, I believe you are correct; that's indeed a flaw. |
| Comment by Jeffrey Yemin [ 04/Dec/15 ] |
Using the global optime generator, could there be split-brain situations which result in a client seeing election ids go back in time? For instance, a primary could broadcast an election id based on its view of the oplog, and then a new primary is elected which has no knowledge of that optime and broadcasts an older election id. A client that can reach both the old and the new primary could witness both election ids and therefore fail to recognize the new primary. |
| Comment by Eric Milkie [ 03/Dec/15 ] |
|
I'm sorry if I missed it, but what was the argument against my original solution? |
| Comment by Andy Schwerin [ 03/Dec/15 ] |
|
I think we should take milkie's suggestion of setting the high bits for pv1 election ids to 0xFFFF. We can then make a minor modification to the SDAM spec that says if you see two primaries, one with the high bits set and to 0xFFFF and one without, you choose the one with the higher value in the ismaster setVersion field. Drivers and mongoses that don't adopt this change won't be able to do a rolling downgrade of protocol version from 1 to 0, but ones that do will be fine, and everyone can do a rolling upgrade from pv0 to pv1. |
| Comment by Andy Schwerin [ 02/Dec/15 ] |
|
If we make downgrade the tricky direction, maybe we can also train the drivers to recover from it. |
| Comment by Eric Milkie [ 02/Dec/15 ] |
|
We can make the electionId simply the term and give it a BSON type of OID, but it means the first election you have after upgrading to pv1 will not trigger drivers to use the new primary. |
| Comment by Andy Schwerin [ 02/Dec/15 ] |
|
For the new replication protocol (PV1), the election id should just be a clever encoding of the term number that is "oid shaped", no? That is, it simply should not include wall clock time at all. |
| Comment by Eric Milkie [ 02/Dec/15 ] |
|
Proposed solution: use the time from the global optime generator instead of the system time to put in the OID first four bytes. This should result in an ever-increasing electionId, since optimes are guaranteed to be ever-increasing. |