[SERVER-36129] Concurrency stepdown suites should wait for replication of workload setups before starting stepdown thread Created: 13/Jul/18 Updated: 29/Oct/23 Resolved: 14/Aug/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 4.0.2, 4.1.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jack Mulrow | Assignee: | Robert Guo (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v4.0
|
||||||||
| Sprint: | TIG 2018-08-27 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 23 | ||||||||
| Story Points: | 3 | ||||||||
| Description |
|
The concurrency stepdown suites wait until after setup has been called for each workload before starting the stepdown thread because the setup methods don't run with overriden majority read/write concern. The effects of each setup are not guaranteed to be majority committed at this point though, so an immediate stepdown can still roll back some of the setup, like the creation of the TTL index in indexed_insert_ttl.js. A fix for this would be waiting for replication on all shards and the config server before starting the stepdown thread. |
| Comments |
| Comment by Githook User [ 23/Aug/18 ] |
|
Author: {'name': 'Robert Guo', 'email': 'robert.guo@10gen.com', 'username': 'guoyr'}Message: (cherry picked from commit 000436db8a0954ec52ee3f4596a3d61995b1fca8) |
| Comment by Githook User [ 14/Aug/18 ] |
|
Author: {'name': 'Robert Guo', 'email': 'robert.guo@10gen.com', 'username': 'guoyr'}Message: |
| Comment by Robert Guo (Inactive) [ 09/Aug/18 ] |
|
Max pointed out we turn off catch up, which we did in April. The new concurrency stepdown suites were introduced at the end of May. I went through all concurrency_sharded_with_stepdown BFGs between April and May but didn't find a single instance of this failure. I don't have another hypothesis at the moment so I'm going to punt on this investigation and maybe get back to it later this sprint. |
| Comment by Robert Guo (Inactive) [ 09/Aug/18 ] |
|
max.hirschhorn Good point. It wouldn't. New hypothesis: the js version of the stepdown hook doesn't issue the replSetStepUp command, allowing secondaries to catch up for 10 seconds (the default value). Issuing the step up command interrupts the old secondary/new primary's catch up, making it more likely to trigger a rollback. |
| Comment by Max Hirschhorn [ 09/Aug/18 ] |
|
robert.guo, as far I understand it, the reconfig() function only waits for all nodes to report that they are in one of the PRIMARY, SECONDARY, or ARBITER states. Why would that ensure the secondaries are caught up to the primary. |
| Comment by Robert Guo (Inactive) [ 09/Aug/18 ] |
|
In the JS version, the election timeout is different for the server and the stepdown thread, causing a reconfig to be run after setup but before the first FSM workload. |