[SERVER-21744] Clients may fail to discover new primaries when clock skew between nodes is greater than electionTimeout Created: 02/Dec/15  Updated: 25/Jan/17  Resolved: 13/Jan/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.0-rc5
Fix Version/s: 3.2.3, 3.3.1

Type: Bug Priority: Major - P3
Reporter: Matt Dannenberg Assignee: Siyuan Zhou
Resolution: Done Votes: 0
Labels: code-and-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
is documented by DOCS-8940 Clients may fail to discover new prim... Closed
Related
related to SERVER-21789 mongos replica set monitor should cho... Closed
related to DRIVERS-279 Use setVersion and electionId to dete... Closed
related to SERVER-18717 compose electionId in OID format from... Closed
is related to DRIVERS-228 Use electionId to detect stale primar... Closed
Backwards Compatibility: Minor Change
Operating System: ALL
Backport Completed:
Sprint: Repl D (12/11/15), Repl E (01/08/16), Repl F (01/29/16)
Participants:

 Description   

Assume there exist two nodes in a set (A and B) and that node A's clock is X seconds ahead of node B's clock. If node A is elected and then node B is elected within X seconds of node A being elected, node B's electionId will be less than node A's electionId, since it happened "earlier."



 Comments   
Comment by Githook User [ 12/Jan/16 ]

Author:

{u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}

Message: SERVER-21744 ElectionID always increases under PV0 and PV1.

Reset election id on PV upgrade and downgrade.

(cherry picked from commit 1c28e37982441275cc127853985b30f2c6e74ff5)
Branch: v3.2
https://github.com/mongodb/mongo/commit/21a507148d36d9adabcf105ac87a34f4f2007821

Comment by Githook User [ 11/Jan/16 ]

Author:

{u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}

Message: SERVER-21744 ElectionID always increases under PV0 and PV1.

Reset election id on PV upgrade and downgrade.
Branch: master
https://github.com/mongodb/mongo/commit/1c28e37982441275cc127853985b30f2c6e74ff5

Comment by Eric Milkie [ 04/Dec/15 ]

A refinement to my idea: we can update the electionId twice, to avoid the increase in failover time. Immediately after being elected, a node can set the electionId time to be the time of the last committed op it currently has. Then, when it succeeds in committing its first op written, it can update the electionId time again.

Comment by Matt Dannenberg [ 04/Dec/15 ]

david.golden A potential flaw with your proposed solution:

  • Node A becomes PRIMARY with setVersion 2, which has protocolVersion 1 and receives an 0xFFFF electionId.
  • Reconfig changes setVersion to 3 and protocolVersion to 0.
  • Driver comes online and sees the 0xFFFF electionId with setVersion 3.
  • Node B is elected with setVersion 3 and new electionId based on the time.
  • Driver never acknowledges node B as primary.
Comment by Eric Milkie [ 04/Dec/15 ]

Jeff, I believe you are correct; that's indeed a flaw.
If the new primary waited until its first logged entry was committed before setting and broadcasting the electionId, that would solve this. It would increase failover time significantly, but only for clients that were not doing w:majority writes. For w:majority writes, the failover time would be increased by a small amount (the time it takes to do the write on the primary, without waiting for replication).

Comment by Jeffrey Yemin [ 04/Dec/15 ]

Proposed solution: use the time from the global optime generator instead of the system time to put in the OID first four bytes. This should result in an ever-increasing electionId, since optimes are guaranteed to be ever-increasing

Using the global optime generator, could there be split-brain situations which result in a client seeing election ids go back in time? For instance, a primary could broadcast an election id based on its view of the oplog, and then a new primary is elected which has no knowledge of that optime and broadcasts an older election id. A client that can reach both the old and the new primary could witness both election ids and therefore fail to recognize the new primary.

Comment by Eric Milkie [ 03/Dec/15 ]

I'm sorry if I missed it, but what was the argument against my original solution?

Comment by Andy Schwerin [ 03/Dec/15 ]

I think we should take milkie's suggestion of setting the high bits for pv1 election ids to 0xFFFF. We can then make a minor modification to the SDAM spec that says if you see two primaries, one with the high bits set and to 0xFFFF and one without, you choose the one with the higher value in the ismaster setVersion field. Drivers and mongoses that don't adopt this change won't be able to do a rolling downgrade of protocol version from 1 to 0, but ones that do will be fine, and everyone can do a rolling upgrade from pv0 to pv1.

Comment by Andy Schwerin [ 02/Dec/15 ]

If we make downgrade the tricky direction, maybe we can also train the drivers to recover from it.

Comment by Eric Milkie [ 02/Dec/15 ]

We can make the electionId simply the term and give it a BSON type of OID, but it means the first election you have after upgrading to pv1 will not trigger drivers to use the new primary.
Or, we could fill in the time bytes of the OID with FFFFFF and leave the rest of the bytes as the term, but this just means the first election after a downgrade to pv0 will not trigger drivers to use the new primary.

Comment by Andy Schwerin [ 02/Dec/15 ]

For the new replication protocol (PV1), the election id should just be a clever encoding of the term number that is "oid shaped", no? That is, it simply should not include wall clock time at all.

Comment by Eric Milkie [ 02/Dec/15 ]

Proposed solution: use the time from the global optime generator instead of the system time to put in the OID first four bytes. This should result in an ever-increasing electionId, since optimes are guaranteed to be ever-increasing.

Generated at Thu Feb 08 03:58:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.