[SERVER-41096] ContinuousStepdown thread and resmoke runner do not synchronize properly on the "stepdown permitted file" and "stepping down file" Created: 10/May/19  Updated: 29/Oct/23  Resolved: 20/Jun/19

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.0.11, 4.2.0-rc2, 4.3.1

Type: Bug Priority: Major - P3
Reporter: Esha Maharishi (Inactive) Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: bkp, tig-concurrency
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-39993 Add kill and terminate versions of co... Closed
Problem/Incident
causes SERVER-42195 Stepdown suites fail with Python exce... Closed
causes SERVER-72957 stepdown suites logs are polluted wit... Closed
Related
is related to SERVER-39770 FSM connection cache setup can fail w... Closed
is related to SERVER-34555 Migrate concurrency_sharded_with_step... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.2, v4.0
Steps To Reproduce:

I am able to show this race on v4.0 by applying the following patch and running yield_group.js in the concurrency_sharded_with_stepdowns suite.

Note that this only repro's on v4.0, since the replication team completed the "avoid closing connections on stepdown" project in 4.1.x.

diff --git a/buildscripts/resmokelib/testing/hooks/stepdown.py b/buildscripts/resmokelib/testing/hooks/stepdown.py
index 47bd3e5..22769b8 100644
--- a/buildscripts/resmokelib/testing/hooks/stepdown.py
+++ b/buildscripts/resmokelib/testing/hooks/stepdown.py
@@ -247,6 +247,9 @@ class _StepdownThread(threading.Thread):  # pylint: disable=too-many-instance-at
         self._stepdown_starting()
         try:
             if self._is_permitted():
+                self.logger.info("Got permission to run stepdowns, now waiting for 10 seconds")
+                self._wait(10)
+                self.logger.info("Finished waiting")
                 for rs_fixture in self._rs_fixtures:
                     self._step_down(rs_fixture)
         finally:
diff --git a/jstests/concurrency/fsm_workloads/yield_group.js b/jstests/concurrency/fsm_workloads/yield_group.js
index c44866c..807f37d 100644
--- a/jstests/concurrency/fsm_workloads/yield_group.js
+++ b/jstests/concurrency/fsm_workloads/yield_group.js
@@ -72,12 +72,14 @@ var $config = (function() {
      * Reset parameters.
      */
     function teardown(db, collName, cluster) {
+        while(true) {
         cluster.executeOnMongodNodes(function resetYieldParams(db) {
             assertAlways.commandWorked(
                 db.adminCommand({setParameter: 1, internalQueryExecYieldIterations: 128}));
             assertAlways.commandWorked(
                 db.adminCommand({setParameter: 1, internalQueryExecYieldPeriodMS: 10}));
         });
+        }
     }
 
     return {

Sprint: STM 2019-07-01
Participants:
Linked BF Score: 18
Story Points: 3

 Description   

Before running workload teardowns, the fsm runner's main thread

But the continuous stepdown thread does the following:

This allows the following interleaving:

  • continuous stepdown thread checks for "stepdown permitted file" and sees it
  • fsm runner thread removes "stepdown permitted file"
  • fsm runner thread checks for "stepping down file" and doesn't see it
  • fsm runner thread starts executing a workload's teardown
  • continuous stepdown thread starts a stepdown round, which can cause the workload's teardown thread to get a network error|


 Comments   
Comment by Githook User [ 21/Jun/19 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-41096 Fix file-based protocol for permitting stepdowns.

Changes the file-based protocol for controlling when stepdowns are
permitted to be a one-shot mechanism usable only once during a test.
That is to say, the indication for whether the stepdown thread isn't
currently and will no longer continue to run stepdowns during the test
persists until after the test finishes.

(cherry picked from commit eea65efbdd4f20022973cf38455c22c5b62af9f3)
Branch: v4.0
https://github.com/mongodb/mongo/commit/726cfdeab21ea104666018b1e52643ec2bb5d366

Comment by Githook User [ 21/Jun/19 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-41096 Fix file-based protocol for permitting stepdowns.

Changes the file-based protocol for controlling when stepdowns are
permitted to be a one-shot mechanism usable only once during a test.
That is to say, the indication for whether the stepdown thread isn't
currently and will no longer continue to run stepdowns during the test
persists until after the test finishes.

(cherry picked from commit eea65efbdd4f20022973cf38455c22c5b62af9f3)
Branch: v4.2
https://github.com/mongodb/mongo/commit/f8257565ce577b5a78b26a5d04a0cdc3e3700db6

Comment by Githook User [ 20/Jun/19 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-41096 Fix file-based protocol for permitting stepdowns.

Changes the file-based protocol for controlling when stepdowns are
permitted to be a one-shot mechanism usable only once during a test.
That is to say, the indication for whether the stepdown thread isn't
currently and will no longer continue to run stepdowns during the test
persists until after the test finishes.
Branch: master
https://github.com/mongodb/mongo/commit/eea65efbdd4f20022973cf38455c22c5b62af9f3

Comment by Vesselina Ratcheva (Inactive) [ 05/Jun/19 ]

I started seeing this very frequently in SERVER-39993. We don't close connections anymore, but we (will) kill/terminate the nodes, which should have about the same effect with respect to the teardown ops. We're not sure why there is such a noticeable difference in frequency: I can reliably get at least one repro per patch for SERVER-39993, but we only have a handful of BFGs documented for the 4.0 case. This question remains to be answered. At any rate, it would be wise to fix this before those suites go in, so I'm marking this as a dependency.

Generated at Thu Feb 08 04:56:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.