[SERVER-43703] Race when disabling rsSyncApplyStop failpoint and stopping server Created: 28/Sep/19  Updated: 29/Oct/23  Resolved: 05/Oct/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.3.1, 4.2.2, 4.0.14

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2, v4.0, v3.6, v3.4
Participants:
Linked BF Score: 8

 Description   

In periodic_kill_secondaries.py we call _kill_secondaries() at the start of each test. For each secondary, that function, disables the rsSyncStopApply failpoint, sleeps 100 ms, then kills the secondary.

In sync_tail.cpp, in _oplogApplication, we have:

        if (MONGO_FAIL_POINT(rsSyncApplyStop)) {
            while (MONGO_FAIL_POINT(rsSyncApplyStop)) {
                // Tests should not trigger clean shutdown while that failpoint is active. If we
                // think we need this, we need to think hard about what the behavior should be.
                if (inShutdown()) {
                    severe() << "Turn off rsSyncApplyStop before attempting clean shutdown";
                    fassertFailedNoTrace(40304);
                }
                sleepmillis(10);
            }
        }

I think there's a clear race, if the oplog application thread happens to hang for longer than 100 ms between checking MONGO_FAIL_POINT(rsSyncApplyStop) and checking inShutdown(), then periodic_kill_secondaries.py can turn off the failpoint and start shutting down mongod during that hang. By the time the thread calls inShutdown(), its value is true and we fassert.

As the comment says, we need to think hard about what the behavior should be. One idea is for periodic_kill_secondaries.py to wait to shut down mongod until the mongod code has definitely left the while loop; we could add a log message after the while loop which periodic_kill_secondaries.py could wait for.

I prefer a different idea: Let's handle the case where rsSyncApplyStop is still enabled when mongod shuts down. As a side effect, this change will also fix the race condition if rsSyncApplyStop is disabled immediately before mongod shuts down. We can simply exit the while loop if inShutdown() is true, and from there we proceed to the normal shutdown path.



 Comments   
Comment by Githook User [ 01/Nov/19 ]

Author:

{'username': 'ajdavis', 'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis'}

Message: SERVER-43703 On shutdown check rsSyncApplyStop is disabled

The previous code had a race: if the test disables the rsSyncApplyStop
failpoint and immediately kills mongod, then mongod could fassert. Now
it fasserts only if the failpoint is still enabled on shutdown.
Branch: v4.0
https://github.com/mongodb/mongo/commit/9254992748ebe5dcf441591bf26a2ab5448220ba

Comment by Githook User [ 23/Oct/19 ]

Author:

{'username': 'ajdavis', 'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis'}

Message: SERVER-43703 On shutdown check rsSyncApplyStop is disabled

The previous code had a race: if the test disables the rsSyncApplyStop
failpoint and immediately kills mongod, then mongod could fassert. Now
it fasserts only if the failpoint is still enabled on shutdown.
Branch: v4.2
https://github.com/mongodb/mongo/commit/460f931088439551bc7a3af2366ac5a7397391e6

Comment by Githook User [ 05/Oct/19 ]

Author:

{'username': 'ajdavis', 'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis'}

Message: SERVER-43703 On shutdown check rsSyncApplyStop is disabled

The previous code had a race: if the test disables the rsSyncApplyStop
failpoint and immediately kills mongod, then mongod could fassert. Now
it fasserts only if the failpoint is still enabled on shutdown.
Branch: master
https://github.com/mongodb/mongo/commit/2664c92b226bf94bb9da85c58b8820771c79c434

Generated at Thu Feb 08 05:03:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.