Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.0.11, 4.2.0-rc2, 4.3.1
Affects Version/s: None
Component/s: Testing Infrastructure
Labels:
- bkp
- tig-concurrency

Backwards Compatibility:
Fully Compatible
Backport Requested:

v4.2, v4.0
Steps To Reproduce:
Hide

I am able to show this race on v4.0 by applying the following patch and running yield_group.js in the concurrency_sharded_with_stepdowns suite.

Note that this only repro's on v4.0, since the replication team completed the "avoid closing connections on stepdown" project in 4.1.x.

diff --git a/buildscripts/resmokelib/testing/hooks/stepdown.py b/buildscripts/resmokelib/testing/hooks/stepdown.py index 47bd3e5..22769b8 100644 --- a/buildscripts/resmokelib/testing/hooks/stepdown.py +++ b/buildscripts/resmokelib/testing/hooks/stepdown.py @@ -247,6 +247,9 @@ class _StepdownThread(threading.Thread): # pylint: disable=too-many-instance-at self._stepdown_starting() try: if self._is_permitted(): + self.logger.info("Got permission to run stepdowns, now waiting for 10 seconds") + self._wait(10) + self.logger.info("Finished waiting") for rs_fixture in self._rs_fixtures: self._step_down(rs_fixture) finally: diff --git a/jstests/concurrency/fsm_workloads/yield_group.js b/jstests/concurrency/fsm_workloads/yield_group.js index c44866c..807f37d 100644 --- a/jstests/concurrency/fsm_workloads/yield_group.js +++ b/jstests/concurrency/fsm_workloads/yield_group.js @@ -72,12 +72,14 @@ var $config = (function() { * Reset parameters. */ function teardown(db, collName, cluster) { + while(true) { cluster.executeOnMongodNodes(function resetYieldParams(db) { assertAlways.commandWorked( db.adminCommand({setParameter: 1, internalQueryExecYieldIterations: 128})); assertAlways.commandWorked( db.adminCommand({setParameter: 1, internalQueryExecYieldPeriodMS: 10})); }); + } } return {
Show
I am able to show this race on v4.0 by applying the following patch and running yield_group.js in the concurrency_sharded_with_stepdowns suite. Note that this only repro's on v4.0, since the replication team completed the "avoid closing connections on stepdown" project in 4.1.x. diff --git a/buildscripts/resmokelib/testing/hooks/stepdown.py b/buildscripts/resmokelib/testing/hooks/stepdown.py index 47bd3e5..22769b8 100644 --- a/buildscripts/resmokelib/testing/hooks/stepdown.py +++ b/buildscripts/resmokelib/testing/hooks/stepdown.py @@ -247,6 +247,9 @@ class _StepdownThread(threading.Thread): # pylint: disable=too-many-instance-at self._stepdown_starting() try: if self._is_permitted(): + self.logger.info("Got permission to run stepdowns, now waiting for 10 seconds") + self._wait(10) + self.logger.info("Finished waiting") for rs_fixture in self._rs_fixtures: self._step_down(rs_fixture) finally: diff --git a/jstests/concurrency/fsm_workloads/yield_group.js b/jstests/concurrency/fsm_workloads/yield_group.js index c44866c..807f37d 100644 --- a/jstests/concurrency/fsm_workloads/yield_group.js +++ b/jstests/concurrency/fsm_workloads/yield_group.js @@ -72,12 +72,14 @@ var $config = (function() { * Reset parameters. */ function teardown(db, collName, cluster) { + while(true) { cluster.executeOnMongodNodes(function resetYieldParams(db) { assertAlways.commandWorked( db.adminCommand({setParameter: 1, internalQueryExecYieldIterations: 128})); assertAlways.commandWorked( db.adminCommand({setParameter: 1, internalQueryExecYieldPeriodMS: 10})); }); + } } return {
Sprint:
STM 2019-07-01
Linked BF Score:
18
Story Points:
3
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Before running workload teardowns, the fsm runner's main thread

But the continuous stepdown thread does the following:

checks for the "stepdown permitted file"
on starting a stepdown round, writes the "stepping down file"
on completing the stepdown round, removes the "stepping down file."

This allows the following interleaving:

continuous stepdown thread checks for "stepdown permitted file" and sees it
fsm runner thread removes "stepdown permitted file"
fsm runner thread checks for "stepping down file" and doesn't see it
fsm runner thread starts executing a workload's teardown
continuous stepdown thread starts a stepdown round, which can cause the workload's teardown thread to get a network error|

causes

SERVER-42195 Stepdown suites fail with Python exception when run with --repeat >1

Closed

SERVER-72957 stepdown suites logs are polluted with non relevant error messages

Closed

is depended on by

SERVER-39993 Add kill and terminate versions of concurrency step down suites

Closed

is related to

SERVER-39770 FSM connection cache setup can fail with step down

Closed

SERVER-34555 Migrate concurrency_sharded_with_stepdowns{,_and_balancer}.yml test suites to run directly via resmoke.py

Closed

Assignee:: Max Hirschhorn
Reporter:: Esha Maharishi (Inactive)
Participants:: Esha Maharishi, Githook User, Max Hirschhorn, Vesselina Ratcheva
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: May 10 2019 07:44:22 PM UTC
Updated:: Oct 29 2023 10:21:08 PM UTC
Resolved:: Jun 20 2019 09:23:33 PM UTC
Confidence Status Last Update:: 19/Jun/19 6:17 AM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates