Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-41096

ContinuousStepdown thread and resmoke runner do not synchronize properly on the "stepdown permitted file" and "stepping down file"

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Backport Requested:
      v4.2, v4.0
    • Steps To Reproduce:
      Hide

      I am able to show this race on v4.0 by applying the following patch and running yield_group.js in the concurrency_sharded_with_stepdowns suite.

      Note that this only repro's on v4.0, since the replication team completed the "avoid closing connections on stepdown" project in 4.1.x.

      diff --git a/buildscripts/resmokelib/testing/hooks/stepdown.py b/buildscripts/resmokelib/testing/hooks/stepdown.py
      index 47bd3e5..22769b8 100644
      --- a/buildscripts/resmokelib/testing/hooks/stepdown.py
      +++ b/buildscripts/resmokelib/testing/hooks/stepdown.py
      @@ -247,6 +247,9 @@ class _StepdownThread(threading.Thread):  # pylint: disable=too-many-instance-at
               self._stepdown_starting()
               try:
                   if self._is_permitted():
      +                self.logger.info("Got permission to run stepdowns, now waiting for 10 seconds")
      +                self._wait(10)
      +                self.logger.info("Finished waiting")
                       for rs_fixture in self._rs_fixtures:
                           self._step_down(rs_fixture)
               finally:
      diff --git a/jstests/concurrency/fsm_workloads/yield_group.js b/jstests/concurrency/fsm_workloads/yield_group.js
      index c44866c..807f37d 100644
      --- a/jstests/concurrency/fsm_workloads/yield_group.js
      +++ b/jstests/concurrency/fsm_workloads/yield_group.js
      @@ -72,12 +72,14 @@ var $config = (function() {
            * Reset parameters.
            */
           function teardown(db, collName, cluster) {
      +        while(true) {
               cluster.executeOnMongodNodes(function resetYieldParams(db) {
                   assertAlways.commandWorked(
                       db.adminCommand({setParameter: 1, internalQueryExecYieldIterations: 128}));
                   assertAlways.commandWorked(
                       db.adminCommand({setParameter: 1, internalQueryExecYieldPeriodMS: 10}));
               });
      +        }
           }
       
           return {
      

      Show
      I am able to show this race on v4.0 by applying the following patch and running yield_group.js in the concurrency_sharded_with_stepdowns suite. Note that this only repro's on v4.0, since the replication team completed the "avoid closing connections on stepdown" project in 4.1.x. diff --git a/buildscripts/resmokelib/testing/hooks/stepdown.py b/buildscripts/resmokelib/testing/hooks/stepdown.py index 47bd3e5..22769b8 100644 --- a/buildscripts/resmokelib/testing/hooks/stepdown.py +++ b/buildscripts/resmokelib/testing/hooks/stepdown.py @@ -247,6 +247,9 @@ class _StepdownThread(threading.Thread): # pylint: disable=too-many-instance-at self._stepdown_starting() try: if self._is_permitted(): + self.logger.info("Got permission to run stepdowns, now waiting for 10 seconds") + self._wait(10) + self.logger.info("Finished waiting") for rs_fixture in self._rs_fixtures: self._step_down(rs_fixture) finally: diff --git a/jstests/concurrency/fsm_workloads/yield_group.js b/jstests/concurrency/fsm_workloads/yield_group.js index c44866c..807f37d 100644 --- a/jstests/concurrency/fsm_workloads/yield_group.js +++ b/jstests/concurrency/fsm_workloads/yield_group.js @@ -72,12 +72,14 @@ var $config = (function() { * Reset parameters. */ function teardown(db, collName, cluster) { + while(true) { cluster.executeOnMongodNodes(function resetYieldParams(db) { assertAlways.commandWorked( db.adminCommand({setParameter: 1, internalQueryExecYieldIterations: 128})); assertAlways.commandWorked( db.adminCommand({setParameter: 1, internalQueryExecYieldPeriodMS: 10})); }); + } } return {
    • Sprint:
      STM 2019-07-01
    • Linked BF Score:
      18
    • Story Points:
      3

      Description

      Before running workload teardowns, the fsm runner's main thread

      But the continuous stepdown thread does the following:

      This allows the following interleaving:

      • continuous stepdown thread checks for "stepdown permitted file" and sees it
      • fsm runner thread removes "stepdown permitted file"
      • fsm runner thread checks for "stepping down file" and doesn't see it
      • fsm runner thread starts executing a workload's teardown
      • continuous stepdown thread starts a stepdown round, which can cause the workload's teardown thread to get a network error|

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: