[SERVER-20671] step down should resend heartbeats if secondaries are not caught up Created: 28/Sep/15  Updated: 25/Jan/17  Resolved: 30/Sep/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.1.9

Type: Bug Priority: Major - P3
Reporter: Benety Goh Assignee: Benety Goh
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-20673 support enqueue-only operation for Gl... Closed
Related
related to SERVER-20832 step down command should restart hear... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: RPL A (10/09/15)
Participants:
Linked BF Score: 0

 Description   

When a primary is requested to step down without

{force: true}

and secondaries are not caught up, it has to wait until the previously scheduled heartbeats are run to obtain updated liveness information on the secondaries before completing the step down process. This may take a while with a long heartbeat interval. Restarting the heartbeats if the primary cannot step down immediately will ensure that we get the most update information on the secondaries in the cluster.



 Comments   
Comment by Githook User [ 01/Oct/15 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-20671 removed StepDownRunner from ReplicationCoordinator test
Branch: master
https://github.com/mongodb/mongo/commit/27693c2c5261fbb7d848d2f1abfb33a390760773

Comment by Githook User [ 30/Sep/15 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-20671 step down restarts heartbeats before waiting for secondaries to catch up

This re-applies commit 3331d34e110f47b5ef27eff74c7c302483fcc8f9 and also fixes a race condition
in the StepDownCatchUp test case by using the non-blocking version of stepDown.
Branch: master
https://github.com/mongodb/mongo/commit/c4e2be33524776da70d77ada71eaf03ecb8e7d44

Comment by Githook User [ 30/Sep/15 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-20671 added non-blocking version of ReplicationCoordinator::stepDown()
Branch: master
https://github.com/mongodb/mongo/commit/e5fbc5fda5a0b65e994b17feed12cb6c00717acf

Comment by J Rassi [ 29/Sep/15 ]

3331d34e introduced a hang in StepDownTest::StepDownCatchUp. On my desktop, I was able to reproduce this hang on 4/500 runs of repl_coordinator_impl_test when compiling against this commit, and was unable to reproduce this hang after 500 runs when compiling against the parent commit. See also recent hangs of the compile suite on Evergreen (task, task, task, task, task, task, task).

I've reverted this commit above. benety.goh, please investigate when you get a chance.

Comment by Githook User [ 29/Sep/15 ]

Author:

{u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}

Message: Revert "SERVER-20671 step down restarts heartbeats before waiting for secondaries to catch up"

This reverts commit 3331d34e110f47b5ef27eff74c7c302483fcc8f9.
Branch: master
https://github.com/mongodb/mongo/commit/de6eab9a60f9643696b86621d008e2a22852a1b9

Comment by Andy Schwerin [ 28/Sep/15 ]

Please explain why in the description.

Comment by Githook User [ 28/Sep/15 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-20671 step down restarts heartbeats before waiting for secondaries to catch up
Branch: master
https://github.com/mongodb/mongo/commit/3331d34e110f47b5ef27eff74c7c302483fcc8f9

Generated at Thu Feb 08 03:54:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.