[SERVER-36451] ContinuousStepdown with killing nodes can hang due to not being able to start the primary Created: 03/Aug/18  Updated: 29/Oct/23  Resolved: 28/Aug/18

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 3.6.9, 4.0.4, 4.1.3

Type: Bug Priority: Major - P3
Reporter: David Bradford (Inactive) Assignee: Jonathan Abrahams
Resolution: Fixed Votes: 0
Labels: tig-resmoke
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
is related to SERVER-35383 Increase electionTimeoutMillis for th... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: TIG 2018-09-10
Participants:
Linked BF Score: 19
Story Points: 2

 Description   

The replica_sets_kill_primary_jscore_passthrough tests occasionally timeout due waiting for a primary to be selected.

The tests increase the election timeout to 24 hours to have control over which node is the leader. However, this can lead to a situation where the leader has been killed and both secondaries were unable to take over due to having stale oplogs. When the server is brought back up and attempts to stepup, there is a chance it has not yet heard back heartbeats from the other nodes in the cluster and assumes they are down. This means the stepup fails and another election is not attempted causing the test to eventually timeout.

A possible solution, in the event of a failure would be to retry the stepup after some delay. This would allow the secondaries more time to respond to the heart beat request.



 Comments   
Comment by Githook User [ 03/Oct/18 ]

Author:

{'name': 'Jonathan Abrahams', 'email': 'jonathan@mongodb.com', 'username': 'hptabster'}

Message: SERVER-36451 ContinuousStepdown with killing nodes can hang due to not being able to start the primary

(cherry picked from commit 1a2599c31d296d79bf78f7b19305c1e983a8f858)
Branch: v4.0
https://github.com/mongodb/mongo/commit/1f5551f610f7f7ab75fcfe8479670bbf9ba04892

Comment by Githook User [ 27/Sep/18 ]

Author:

{'name': 'Jonathan Abrahams', 'email': 'jonathan@mongodb.com', 'username': 'hptabster'}

Message: SERVER-36451 ContinuousStepdown with killing nodes can hang due to not being able to start the primary

(cherry picked from commit 1a2599c31d296d79bf78f7b19305c1e983a8f858)
Branch: v3.6
https://github.com/mongodb/mongo/commit/cdda2ef1e7ae3aa6eecfb0a8f48bf62e2a9ab109

Comment by Githook User [ 28/Aug/18 ]

Author:

{'name': 'Jonathan Abrahams', 'email': 'jonathan@mongodb.com', 'username': 'hptabster'}

Message: SERVER-36451 ContinuousStepdown with killing nodes can hang due to not being able to start the primary
Branch: master
https://github.com/mongodb/mongo/commit/1a2599c31d296d79bf78f7b19305c1e983a8f858

Comment by Max Hirschhorn [ 28/Aug/18 ]

I think waiting for 60 seconds should be sufficient.

I'm less sure about this because there was a failure in BF-10191 where we would effectively need to retry stepping up the former primary until rollback on at least one of the secondaries finished. I don't have confidence that would finish within 1 minute in all cases. I'd be more comfortable with something on the order of ReplFixture.AWAIT_REPL_TIMEOUT_MINS, which is 5 minutes.

Comment by David Bradford (Inactive) [ 28/Aug/18 ]

I think there should be a limit of some sort. I think either type of limit would be fine.

From what I saw investigating the BF, I would think that 60 seconds would be more than enough. 

Comment by Jonathan Abrahams [ 28/Aug/18 ]

david.bradford max.hirschhorn Is there a time limit, or number of retries we should add here? I think waiting for 60 seconds should be sufficient.

Generated at Thu Feb 08 04:43:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.