[SERVER-20832] step down command should restart heartbeats at most once Created: 07/Oct/15  Updated: 25/Jan/17  Resolved: 14/Oct/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.2.0-rc0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Benety Goh
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-20964 convert stepdown_killop.js to use fai... Closed
is related to SERVER-20671 step down should resend heartbeats if... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: RPL A (10/09/15), Repl B (10/30/15)
Participants:

 Description   

While waiting for secondaries to catch up during a step down request, the primary seems to be sending out heartbeats constantly to the secondaries. The step down command should be restarting the heartbeats once and allow the replication coordinator to reschedule new heartbeats every "heartbeatIntervalMillis" ms. This bug seems to have been introduced by SERVER-20671.

----------

Task
Logs

[js_test:stepdown_long_wait_time] 2015-10-07T16:57:49.703-0400 d20010| 2015-10-07T16:57:49.703-0400 I COMMAND  [conn8] command admin.$cmd command: replSetStepDown { replSetStepDown: 60.0, secondaryCatchUpPeriodSecs: 60.0 } ntoreturn:1 ntoskip:0 keyUpdates:0 writeConflicts:0 numYields:0 reslen:150 locks:{ Global: { acquireCount: { r: 1, R: 1 } } } protocol:op_command 61264ms
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:49.704-0400 d20010| 2015-10-07T16:57:49.703-0400 I COMMAND  [conn7] command admin.$cmd command: isMaster { ismaster: 1.0 } ntoreturn:1 ntoskip:0 keyUpdates:0 writeConflicts:0 numYields:0 reslen:488 locks:{} protocol:op_command 1226ms
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:49.704-0400 d20010| 2015-10-07T16:57:49.703-0400 I REPL     [replExecDBWorker-2] transition to SECONDARY
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:49.704-0400 d20010| 2015-10-07T16:57:49.704-0400 I NETWORK  [conn7] end connection 127.0.0.1:61013 (6 connections now open)
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:49.705-0400 d20010| 2015-10-07T16:57:49.704-0400 I NETWORK  [conn8] end connection 127.0.0.1:61032 (6 connections now open)
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:49.705-0400 d20010| 2015-10-07T16:57:49.704-0400 I NETWORK  [conn11] end connection 208.52.191.216:49333 (6 connections now open)
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:49.705-0400 d20010| 2015-10-07T16:57:49.704-0400 I NETWORK  [conn14] end connection 208.52.191.216:49367 (5 connections now open)
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:49.705-0400 sh20277| {
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:49.705-0400 sh20277|   "ok" : 0,
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:49.705-0400 sh20277|   "errmsg" : "By the time we were ready to step down, we were already past the time we were supposed to step down until",
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:49.705-0400 sh20277|   "code" : 50
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:49.705-0400 sh20277| }
...
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:57.685-0400 2015-10-07T16:57:57.682-0400 E QUERY    [thread1] Error: [0] != [0] are equal : expected replSetStepDown to close the shell's connection :
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:57.685-0400 doassert@src/mongo/shell/assert.js:15:14
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:57.685-0400 assert.neq@src/mongo/shell/assert.js:119:5
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:57.685-0400 @jstests/replsets/stepdown_long_wait_time.js:95:5
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:57.685-0400 @jstests/replsets/stepdown_long_wait_time.js:10:2
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:57.685-0400
[js_test:stepdown_long_wait_time] 2015-10-07T16:57:57.685-0400 failed to load: jstests/replsets/stepdown_long_wait_time.js



 Comments   
Comment by Githook User [ 14/Oct/15 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-20832 use fail point instead of bridging to control replication in stepdown_long_wait_time.js
Branch: master
https://github.com/mongodb/mongo/commit/5eaf403f3659f7e4b6372be0f50d82863b704b47

Comment by Githook User [ 14/Oct/15 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-20832 step down does not need to call lockMMAPV1Flush() after acquiring global read lock
Branch: master
https://github.com/mongodb/mongo/commit/7d43b0dba28b4b8e0184579a75c3dddab9d86e1f

Comment by Githook User [ 10/Oct/15 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-20832 restart heartbeats at most once per step down command
Branch: master
https://github.com/mongodb/mongo/commit/69dec2fe8fed6d32ec4998ea7ec7ab063cb5b788

Comment by Benety Goh [ 09/Oct/15 ]

some failures observed in build. re-opening for further investigation

Comment by Githook User [ 09/Oct/15 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: Revert "SERVER-20832 use fail point instead of bridging to control replication in stepdown_long_wait_time.js"

This reverts commit cd4a48a4907a9c99d548dda67d27f6e798163141.
Branch: master
https://github.com/mongodb/mongo/commit/e99719a2352465067a80b721090524859d486ab8

Comment by Githook User [ 09/Oct/15 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-20832 use fail point instead of bridging to control replication in stepdown_long_wait_time.js
Branch: master
https://github.com/mongodb/mongo/commit/cd4a48a4907a9c99d548dda67d27f6e798163141

Generated at Thu Feb 08 03:55:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.