-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Fully Compatible
-
ALL
-
v4.2, v4.0, v3.6, v3.4
-
8
In periodic_kill_secondaries.py we call _kill_secondaries() at the start of each test. For each secondary, that function, disables the rsSyncStopApply failpoint, sleeps 100 ms, then kills the secondary.
In sync_tail.cpp, in _oplogApplication, we have:
if (MONGO_FAIL_POINT(rsSyncApplyStop)) { while (MONGO_FAIL_POINT(rsSyncApplyStop)) { // Tests should not trigger clean shutdown while that failpoint is active. If we // think we need this, we need to think hard about what the behavior should be. if (inShutdown()) { severe() << "Turn off rsSyncApplyStop before attempting clean shutdown"; fassertFailedNoTrace(40304); } sleepmillis(10); } }
I think there's a clear race, if the oplog application thread happens to hang for longer than 100 ms between checking MONGO_FAIL_POINT(rsSyncApplyStop) and checking inShutdown(), then periodic_kill_secondaries.py can turn off the failpoint and start shutting down mongod during that hang. By the time the thread calls inShutdown(), its value is true and we fassert.
As the comment says, we need to think hard about what the behavior should be. One idea is for periodic_kill_secondaries.py to wait to shut down mongod until the mongod code has definitely left the while loop; we could add a log message after the while loop which periodic_kill_secondaries.py could wait for.
I prefer a different idea: Let's handle the case where rsSyncApplyStop is still enabled when mongod shuts down. As a side effect, this change will also fix the race condition if rsSyncApplyStop is disabled immediately before mongod shuts down. We can simply exit the while loop if inShutdown() is true, and from there we proceed to the normal shutdown path.