[SERVER-36129] Concurrency stepdown suites should wait for replication of workload setups before starting stepdown thread Created: 13/Jul/18  Updated: 29/Oct/23  Resolved: 14/Aug/18

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.0.2, 4.1.2

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: Robert Guo (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0
Sprint: TIG 2018-08-27
Participants:
Linked BF Score: 23
Story Points: 3

 Description   

The concurrency stepdown suites wait until after setup has been called for each workload before starting the stepdown thread because the setup methods don't run with overriden majority read/write concern. The effects of each setup are not guaranteed to be majority committed at this point though, so an immediate stepdown can still roll back some of the setup, like the creation of the TTL index in indexed_insert_ttl.js.

A fix for this would be waiting for replication on all shards and the config server before starting the stepdown thread.



 Comments   
Comment by Githook User [ 23/Aug/18 ]

Author:

{'name': 'Robert Guo', 'email': 'robert.guo@10gen.com', 'username': 'guoyr'}

Message: SERVER-36129 awaitReplication after setup functions in concurrency
suites

(cherry picked from commit 000436db8a0954ec52ee3f4596a3d61995b1fca8)
Branch: v4.0
https://github.com/mongodb/mongo/commit/f7ae3b339292ab3a8fbd25aaebf073383e299ae6

Comment by Githook User [ 14/Aug/18 ]

Author:

{'name': 'Robert Guo', 'email': 'robert.guo@10gen.com', 'username': 'guoyr'}

Message: SERVER-36129 awaitReplication after setup functions in concurrency
suites
Branch: master
https://github.com/mongodb/mongo/commit/000436db8a0954ec52ee3f4596a3d61995b1fca8

Comment by Robert Guo (Inactive) [ 09/Aug/18 ]

Max pointed out we turn off catch up, which we did in April. The new concurrency stepdown suites were introduced at the end of May. I went through all concurrency_sharded_with_stepdown BFGs between April and May but didn't find a single instance of this failure.

I don't have another hypothesis at the moment so I'm going to punt on this investigation and maybe get back to it later this sprint.

Comment by Robert Guo (Inactive) [ 09/Aug/18 ]

max.hirschhorn Good point. It wouldn't.

New hypothesis: the js version of the stepdown hook doesn't issue the replSetStepUp command, allowing secondaries to catch up for 10 seconds (the default value). Issuing the step up command interrupts the old secondary/new primary's catch up, making it more likely to trigger a rollback.

Comment by Max Hirschhorn [ 09/Aug/18 ]

robert.guo, as far I understand it, the reconfig() function only waits for all nodes to report that they are in one of the PRIMARY, SECONDARY, or ARBITER states. Why would that ensure the secondaries are caught up to the primary.

Comment by Robert Guo (Inactive) [ 09/Aug/18 ]

In the JS version, the election timeout is different for the server and the stepdown thread, causing a reconfig to be run after setup but before the first FSM workload.

Generated at Thu Feb 08 04:42:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.