[SERVER-47544] Stepdown suites can result in spurious InterruptedDueToReplStateChange errors Created: 14/Apr/20 Updated: 29/Oct/23 Resolved: 29/Apr/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.0-rc6, 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Maria van Keulen | Assignee: | Pavithra Vetriselvan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||||||||||
| Sprint: | Repl 2020-05-04 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Linked BF Score: | 30 | ||||||||||||||||||||||||||||
| Description |
|
Presently, it's possible for the ValidateCollections test hook to encounter InterruptedDueToReplStateChange if it's running in a test that also runs the ContinuousStepdown hook. Per the comments in this ticket, this may be due to resmoke.py logging delays resulting in delayed heartbeats and therefore spurious elections (see comment thread in Stepdown suites do not presently guard against these spurious election scenarios. |
| Comments |
| Comment by Githook User [ 14/May/20 ] | ||||||
|
Author: {'name': 'Pavi Vetriselvan', 'email': 'pvselvan@umich.edu', 'username': 'pvselvan'}Message:
(cherry picked from commit f59f63db6c37c0d4657b57d559c95d830b0e34c2)
(cherry picked from commit 4d91fac171cbe3f2af53d9258965399e648a1947)
(cherry picked from commit a43cb23defc6182d08a7814e4731ef98f2d30b6a)
(cherry picked from commit 81e0ad27c280c02a49beb65ff4473d5dce62b089)
(cherry picked from commit 2debab7987b24bf902f9a128654ce928441c29a2)
(cherry picked from commit 91672e58f1169c7edd684b911f20f62b8a71f8d1)
(cherry picked from commit 81d53a715f49827a9f2538d4572f9b01f2b12887) | ||||||
| Comment by Githook User [ 29/Apr/20 ] | ||||||
|
Author: {'name': 'Pavi Vetriselvan', 'email': 'pvselvan@umich.edu', 'username': 'pvselvan'}Message: | ||||||
| Comment by Judah Schvimer [ 15/Apr/20 ] | ||||||
|
I think I included the if not self.all_nodes_electable: because I thought the extra control wasn't necessary when all nodes are electable. I don't see any reason not to always increase the election timeout if that's required for robustness as max.hirschhorn explains. | ||||||
| Comment by Maria van Keulen [ 15/Apr/20 ] | ||||||
|
Got it, thanks max.hirschhorn. I've updated the ticket title and description. Given that this seems to be a suite configuration issue rather than a problem with the ValidateCollections hook, I'm passing this to Replication. | ||||||
| Comment by Max Hirschhorn [ 15/Apr/20 ] | ||||||
|
I think judah.schvimer would need to answer why we didn't change the election timeout for the stepdown suites given that the ContinuousStepdown thread should be running the replSetStepUp or replSetStepDown commands directly. Going back to running the Evegreen task with --jobs=16 on the -large distros, the other option may be to add an entry for the failing tasks in evergreen_resmoke_job_count.py or set a resmoke_jobs_max value for them in etc/evergreen.yml. | ||||||
| Comment by Maria van Keulen [ 15/Apr/20 ] | ||||||
|
Ah, thanks for clarifying. So to reiterate--the stepdown suites currently lack the protection against slow logging behavior, and this ticket should address that. | ||||||
| Comment by Max Hirschhorn [ 15/Apr/20 ] | ||||||
|
We skip raising the election timeout to 24 hours when self.all_nodes_electable == True.
| ||||||
| Comment by Maria van Keulen [ 15/Apr/20 ] | ||||||
max.hirschhorn Could you please clarify this statement? I interpret the resolution of the discussions in | ||||||
| Comment by Max Hirschhorn [ 15/Apr/20 ] | ||||||
|
maria.vankeulen, the ContinuousStepdown thread should have already been paused by the time the ValidateCollections hook is run. This is what ContinuousStepdown.after_test() is responsible for doing. Based on the linked BF ticket and the TIG-2499 ticket you had filed, I suspect you're actually running into an issue where the Evergreen machine is being overwhelmed from running the replica_sets_multi_stmt_txn_{kill,terminate}_primary_jscore_passthrough tasks with --jobs=16 on the -large distros. If resmoke.py is stalled from reading the output pipe of the mongod processes, then it'll cause nodes not to respond to heartbeat requests due to the ReplicationCoordinatorImpl::_mutex being held, and eventually lead to an unexpected stepdown. We don't raise the election timeout to 24 hours to protect against this slow logging behavior when running in the stepdown suites. See also |