[SERVER-50318] Only restart scheduled heartbeats Created: 14/Aug/20  Updated: 29/Oct/23  Resolved: 10/Nov/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.8.0, 4.9.0, 4.4.4

Type: Bug Priority: Major - P3
Reporter: Xuerui Fa Assignee: Xuerui Fa
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-51513 Restart heartbeats for catchup should... Closed
is duplicated by SERVER-48793 Remove "targetIndex" from Replication... Closed
Problem/Incident
causes SERVER-51513 Restart heartbeats for catchup should... Closed
Related
is related to SERVER-29030 Announce new primary via heartbeat re... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Repl 2020-09-07, Repl 2020-09-21, Repl 2020-10-05, Repl 2020-10-19, Repl 2020-11-02, Repl 2020-11-16
Participants:
Linked BF Score: 19

 Description   

After SERVER-29030, we cancel our own heartbeat requests if we receive a heartbeat request that announces a new primary. Since we don't update our knowledge of the primary when we receive a heartbeat request, it seems possible to continuously schedule and cancel our heartbeat requests. As a result, a node in initial sync may not be able to find a sync source, because it has not successfully received 2N heartbeats from other nodes, and eventually the node will shut down.



 Comments   
Comment by Githook User [ 19/Jan/21 ]

Author:

{'name': 'XueruiFa', 'email': 'xuerui.fa@mongodb.com', 'username': 'XueruiFa'}

Message: SERVER-50318: Only cancel scheduled heartbeats

(cherry picked from commit 23ae68b0fecde9f0484dc276f376697d91fcc344)
Branch: v4.4
https://github.com/mongodb/mongo/commit/bddffe1ebc6a14da54d47ebd3a1ba80bb2efaff8

Comment by Githook User [ 10/Nov/20 ]

Author:

{'name': 'XueruiFa', 'email': 'xuerui.fa@mongodb.com', 'username': 'XueruiFa'}

Message: SERVER-50318: Only cancel scheduled heartbeats
Branch: master
https://github.com/mongodb/mongo/commit/23ae68b0fecde9f0484dc276f376697d91fcc344

Comment by Githook User [ 13/Oct/20 ]

Author:

{'name': 'XueruiFa', 'email': 'xuerui.fa@mongodb.com', 'username': 'XueruiFa'}

Message: Revert "SERVER-50318: Only cancel scheduled heartbeats"

This reverts commit 379c0116b694d8d88ec096170e703fe3d0119e55.
Branch: master
https://github.com/mongodb/mongo/commit/8dbb92e85ff1480697baeef0cc56f6fb84f856a9

Comment by Githook User [ 29/Sep/20 ]

Author:

{'name': 'XueruiFa', 'email': 'xuerui.fa@mongodb.com', 'username': 'XueruiFa'}

Message: SERVER-50318: Only cancel scheduled heartbeats
Branch: master
https://github.com/mongodb/mongo/commit/379c0116b694d8d88ec096170e703fe3d0119e55

Comment by Xuerui Fa [ 22/Sep/20 ]

As an update, we decided that the correct approach is to keep track of the state of each heartbeat request. Restarting heartbeats will only restart scheduled heartbeats, whereas heartbeat requests that have already been sent out will not be cancelled. This will also optimize heartbeat cancelling, in addition to resolving the BF.

Comment by Xuerui Fa [ 10/Sep/20 ]

Thanks siyuan.zhou for finding why the interval was low: it was being set as part of replSetTest.initiate(). It seems like this error could occur for any situation where the initial syncing node's heartbeat interval is approximately equal to the primary's heartbeat interval, so we will implement a fix for this.

Our current proposal is, when the node receives a heartbeat request from the primary indicating that it is different from what we think, then we will restart our heartbeat request to only the primary. We can do this by changing the vector of heartbeat handles to a map of MemberId to heartbeat handles instead. This way, we can specify which heartbeat to cancel, using the MemberId of the primary.

One open question I have is, will this proposal work for a 2 node repl set? I think this bug may occur. Let's say we have a primary P and secondary S with approximately equal heartbeat intervals.

1. S is added to the repl set and enters initial sync
2. S receives a heartbeat request from P and updates its config
3. S sends a heartbeat request to P
4. Before S's heartbeat request can complete, it receives another heartbeat request from P and cancels its own heartbeat request.
5. Although S only needs two heartbeat responses from P, it won't be able to receive them, and so S will not be able to choose P as its sync source, and eventually initial sync will fail.

Comment by Xuerui Fa [ 09/Sep/20 ]

It seems like the election timeout and heartbeat intervals are the defaults, according to the primary's heartbeat response. From the logs above:
"heartbeatIntervalMillis":2000,"heartbeatTimeoutSecs":10,"electionTimeoutMillis":10000

I think we can confirm that the primary sends heartbeats too often, although I'm not sure yet why that is the case.

Comment by Siyuan Zhou [ 09/Sep/20 ]

What's the election timeout and heartbeat interval at that time? Can we confirm our theory that the primary sends heartbeats too often to the problematic node which cancels its ongoing heartbeats that runs longer than 200ms?

Comment by Xuerui Fa [ 08/Sep/20 ]

Looking at cases of successful heartbeats from other nodes, it seems like they are also scheduling heartbeats for earlier times than expected. Nodes d20022 receives a hb from d20020 at 2020-06-25T09:24:10.111, yet it schedules the next heartbeat request for 2020-06-25T09:24:10.311.

[js_test:replsettest_control_12_nodes] 2020-06-25T09:24:10.111+0000 d20022| {"t":{"$date":"2020-06-25T09:24:10.110+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615620, "ctx":"ReplCoord-3","msg":"Received response to heartbeat","attr":{"requestId":28,"target":"ip-10-122-18-31:20020","response":{"ok":1.0,"electionTime":{"$date":{"$numberLong":"6842213799693385730"}},"config":{"_id":"replsettest_control_12_nodes","version":4,"term":1,"members":[{"_id":0,"host":"ip-10-122-18-31:20020","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1.0,"tags":{},"slaveDelay":0,"votes":1},{"_id":1,"host":"ip-10-122-18-31:20021","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1.0,"tags":{},"slaveDelay":0,"votes":1},{"_id":2,"host":"ip-10-122-18-31:20022","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1.0,"tags":{},"slaveDelay":0,"votes":1},{"_id":3,"host":"ip-10-122-18-31:20023","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1.0,"tags":{},"slaveDelay":0,"votes":1}],"protocolVersion":1,"writeConcernMajorityJournalDefault":true,"settings":{"chainingAllowed":true,"heartbeatIntervalMillis":2000,"heartbeatTimeoutSecs":10,"electionTimeoutMillis":10000,"catchUpTimeoutMillis":-1,"catchUpTakeoverDelayMillis":30000,"getLastErrorModes":{},"getLastErrorDefaults":{"w":1,"wtimeout":0},"replicaSetId":{"$oid":"5ef46d33558ff97d78e28cef"}}},"state":1,"v":4,"configTerm":1,"set":"replsettest_control_12_nodes","term":1,"primaryId":0,"durableOpTime":{"ts":{"$timestamp":{"t":1593077049,"i":1}},"t":1},"durableWallTime":{"$date":"2020-06-25T09:24:09.798Z"},"opTime":{"ts":{"$timestamp":{"t":1593077049,"i":1}},"t":1},"wallTime":{"$date":"2020-06-25T09:24:09.798Z"},"$replData":{"term":1,"lastOpCommitted":{"ts":{"$timestamp":{"t":1593077048,"i":1}},"t":1},"lastCommittedWall":{"$date":"2020-06-25T09:24:08.465Z"},"lastOpVisible":{"ts":{"$timestamp":{"t":1593077048,"i":1}},"t":1},"configVersion":4,"configTerm":1,"replicaSetId":{"$oid":"5ef46d33558ff97d78e28cef"},"syncSourceIndex":-1,"isPrimary":true},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1593077049,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1593077049,"i":1}}}}}

[js_test:replsettest_control_12_nodes] 2020-06-25T09:24:10.111+0000 d20022| {"t":{"$date":"2020-06-25T09:24:10.111+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615618, "ctx":"ReplCoord-3","msg":"Scheduling heartbeat","attr":{"target":"ip-10-122-18-31:20020","when":{"$date":"2020-06-25T09:24:10.311Z"}}}

Here is the logic for scheduling the next hb. I'm not sure why it's calculating the interval to be 200ms, will continue taking a look.

Comment by Xuerui Fa [ 03/Sep/20 ]

It seems like the average amount of time between heartbeats received from the primary is 0.2 seconds, which is significantly less than the standard heartbeat interval, 2s. I did some more spot checks in other places in the log, and on the whole, the interval seems to consistently be slightly more than 0.2 seconds.

siyuan.zhou and I discussed a potential solution last night, but we should first investigate to see why this heartbeat interval is so low.

Comment by Xuerui Fa [ 03/Sep/20 ]

The node cancelled its heartbeat at 2020-06-25T09:25:10.248

 
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.248+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.236+00:00"},"s":"D2", "c":"REPL_HB",  "id":24096,   "ctx":"conn4","msg":"Processing heartbeat request","attr":{"from":"ip-10-122-18-31:20020","cmdObj":{"replSetHeartbeat":"replsettest_control_12_nodes","configVersion":12,"configTerm":1,"hbv":1,"from":"ip-10-122-18-31:20020","fromId":0,"term":1,"primary     Id":0,"$replData":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1593077080,"i":1}},"signature":{"hash":{"$binary":{"base64":"LG7nPk4sox4OJi6DakIbWSqezOw=","subType":"0"}},"keyId":6842213799693385732}}," $db":"admin"}}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.248+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.237+00:00"},"s":"I",  "c":"REPL",     "id":2903000, "ctx":"conn4","msg":"Restarting heartbeats after learning of a new primary","attr":{"myPrimaryId":"none","senderAndPrimaryId":0,"senderTerm":1}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.248+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.237+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615630, "ctx":"conn4","msg":"Cancelling all heartbeats"}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.249+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.240+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615618, "ctx":"conn4","msg":"Scheduling heartbeat"," attr":{"target":"ip-10-122-18-31:20020","when":{"$date":"2020-06-25T09:25:10.240Z"}}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.249+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.240+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615618, "ctx":"conn4","msg":"Scheduling heartbeat"," attr":{"target":"ip-10-122-18-31:20021","when":{"$date":"2020-06-25T09:25:10.240Z"}}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.249+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.241+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615618, "ctx":"conn4","msg":"Scheduling heartbeat"," attr":{"target":"ip-10-122-18-31:20022","when":{"$date":"2020-06-25T09:25:10.240Z"}}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.249+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.241+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615618, "ctx":"conn4","msg":"Scheduling heartbeat"," attr":{"target":"ip-10-122-18-31:20023","when":{"$date":"2020-06-25T09:25:10.240Z"}}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.249+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.242+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615618, "ctx":"conn4","msg":"Scheduling heartbeat"," attr":{"target":"ip-10-122-18-31:20024","when":{"$date":"2020-06-25T09:25:10.240Z"}}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.249+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.242+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615618, "ctx":"conn4","msg":"Scheduling heartbeat"," attr":{"target":"ip-10-122-18-31:20025","when":{"$date":"2020-06-25T09:25:10.240Z"}}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.249+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.242+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615618, "ctx":"conn4","msg":"Scheduling heartbeat"," attr":{"target":"ip-10-122-18-31:20026","when":{"$date":"2020-06-25T09:25:10.240Z"}}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.249+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.242+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615618, "ctx":"conn4","msg":"Scheduling heartbeat"," attr":{"target":"ip-10-122-18-31:20027","when":{"$date":"2020-06-25T09:25:10.240Z"}}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.249+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.243+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615618, "ctx":"conn4","msg":"Scheduling heartbeat"," attr":{"target":"ip-10-122-18-31:20028","when":{"$date":"2020-06-25T09:25:10.240Z"}}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.249+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.243+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615618, "ctx":"conn4","msg":"Scheduling heartbeat"," attr":{"target":"ip-10-122-18-31:20029","when":{"$date":"2020-06-25T09:25:10.240Z"}}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.249+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.243+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615618, "ctx":"conn4","msg":"Scheduling heartbeat"," attr":{"target":"ip-10-122-18-31:20030","when":{"$date":"2020-06-25T09:25:10.240Z"}}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.249+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.244+00:00"},"s":"D2", "c":"REPL_HB",  "id":24097,   "ctx":"conn4","msg":"Generated heartbeat response","attr":{"from":"ip-10-122-18-31:20020","response":{"ok":1.0,"state":5,"v":12,"configTerm":1,"set":"replsettest_control_12_nodes","term":1,"durableOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1},"durableWallTime":{"$date":"1970-01-01T00:00:00.000Z"},"opTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1},"wallTime":{"$date":"1970-01-01T00:00:00.000Z"}}}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.250+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.244+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615619, "ctx":"ReplCoord-17","msg":"Received response to heartbeat, but the heartbeat was cancelled","attr":{"requestId":1810,"target":"ip-10-122-18-31:20020"}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.250+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.244+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615619, "ctx":"ReplCoord-4","msg":"Received response to heartbeat, but the heartbeat was cancelled","attr":{"requestId":1811,"target":"ip-10-122-18-31:20021"}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.251+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.244+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615619, "ctx":"ReplCoord-6","msg":"Received response to heartbeat, but the heartbeat was cancelled","attr":{"requestId":1812,"target":"ip-10-122-18-31:20022"}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.251+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.245+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615619, "ctx":"ReplCoord-15","msg":"Received response to heartbeat, but the heartbeat was cancelled","attr":{"requestId":1813,"target":"ip-10-122-18-31:20024"}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.251+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.245+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615619, "ctx":"ReplCoord-22","msg":"Received response to heartbeat, but the heartbeat was cancelled","attr":{"requestId":1814,"target":"ip-10-122-18-31:20025"}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.251+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.246+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615619, "ctx":"ReplCoord-21","msg":"Received response to heartbeat, but the heartbeat was cancelled","attr":{"requestId":1815,"target":"ip-10-122-18-31:20026"}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.251+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.246+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615619, "ctx":"ReplCoord-9","msg":"Received response to heartbeat, but the heartbeat was cancelled","attr":{"requestId":1816,"target":"ip-10-122-18-31:20027"}}
 [js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.251+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.246+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615619, "ctx":"ReplCoord-10","msg":"Received response to heartbeat, but the heartbeat was cancelled","attr":{"requestId":1817,"target":"ip-10-122-18-31:20028"}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.256+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.247+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615619, "ctx":"ReplCoord-13","msg":"Received response to heartbeat, but the heartbeat was cancelled","attr":{"requestId":1818,"target":"ip-10-122-18-31:20029"}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.256+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.247+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615619, "ctx":"ReplCoord-7","msg":"Received response to heartbeat, but the heartbeat was cancelled","attr":{"requestId":1819,"target":"ip-10-122-18-31:20030"}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.256+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.247+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615619, "ctx":"ReplCoord-12","msg":"Received response to heartbeat, but the heartbeat was cancelled","attr":{"requestId":1820,"target":"ip-10-122-18-31:20023"}}

The next occurrence of cancelling heartbeat requests happens at 2020-06-25T09:25:10.472:

[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.472+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.459+00:00"},"s":"D2", "c":"REPL_HB",  "id":24095,   "ctx":"conn4","msg":"Received heartbeat request","attr":{"from":"ip-10 -122-18-31:20020","cmdObj":{"replSetHeartbeat":"replsettest_control_12_nodes","configVersion":12,"configTerm":1,"hbv":1,"from":"ip-10-122-18-31:20020","fromId":0,"term":1,"primaryId":0,"$replData":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1593077080,"i":1}},"signature":{"hash":{"$binary":{"base64":"LG7nPk4sox4OJi6DakIbWSqezOw=","subType":"0"}},"keyId":6842213799693385732}},"$db":"admin"}}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.472+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.460+00:00"},"s":"I",  "c":"REPL", "id":2903000, "ctx":"conn4","msg":"Restarting heartbeats after learning of a new primary","attr":{"myPrimaryId":"none","senderAndPrimaryId":0,"senderTerm":1}}

And the next occurrence after that is at 2020-06-25T09:25:10.674:

[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.674+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.670+00:00"},"s":"D2", "c":"REPL_HB",  "id":24095,   "ctx":"conn4","msg":"Received heartbeat request","attr":{"from":"ip-10-122-18-31:20020","cmdObj":{"replSetHeartbeat":"replsettest_control_12_nodes","configVersion":12,"configTerm":1,"hbv":1,"from":"ip-10-122-18-31:20020","fromId":0,"term":1,"primaryId":0,"$replData":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1593077080,"i":1}},"signature":{"hash":{"$binary":{"base64":"LG7nPk4sox4OJi6DakIbWSqezOw=","subType":"0"}},"keyId":6842213799693385732}},"$db":"admin"}}}
[js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.674+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.670+00:00"},"s":"I",  "c":"REPL",     "id":2903000, "ctx":"conn4","msg":"Restarting heartbeats after learning of a new primary","attr":{"myPrimaryId":"none","senderAndPrimaryId":0,"senderTerm":1}}
9003 [js_test:replsettest_control_12_nodes] 2020-06-25T09:25:10.674+0000 d20031| {"t":{"$date":"2020-06-25T09:25:10.671+00:00"},"s":"D2", "c":"REPL_HB",  "id":4615630, "ctx":"conn4","msg":"Cancelling all heartbeats"}

Comment by Xuerui Fa [ 31/Aug/20 ]

siyuan.zhou, what do you think of solutions to this ticket? I think during triage, Matthew mentioned we probably don't want to go with proposal 3.

Comment by Xuerui Fa [ 14/Aug/20 ]

Some possible approaches:

  1. Update our understanding of the primary after receiving a heartbeat request that announces a new primary
    • SERVER-29030 seems to imply that we want to avoid doing this, since we cancel our current heartbeat requests in order to send out a fresh round of heartbeats requests, whose responses should contain information on the new primary
  2. Avoid cancelling heartbeats after hearing about a new primary
    • It seems like we will lose the optimization done in SERVER-29030 if we do this
  3. Relax the 2N heartbeat constraint if we are in initial sync
Generated at Thu Feb 08 05:22:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.