[SERVER-47636] Force reconfig running concurrently with step up can cause reconfig in drain mode to fail Created: 17/Apr/20  Updated: 29/Oct/23  Resolved: 22/Apr/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.5.1, 4.4.0-rc1
Fix Version/s: 4.4.0-rc4, 4.7.0

Type: Bug Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: William Schultz (Inactive)
Resolution: Fixed Votes: 0
Labels: safe-reconfig-related
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File force_reconfig_drain_mode_repro.diff    
Issue Links:
Backports
Depends
Related
related to SERVER-47142 Check primary before writing replset ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Steps To Reproduce:

Applying this diff (force_reconfig_drain_mode_repro.diff) on this commit and running the following commands should reproduce the bug:

ninja -j400 build/ninja/mongo/db/repl/db_repl_coordinator_test
build/ninja/mongo/db/repl/db_repl_coordinator_test  --suite ReplCoordTest --filter NodeReturnsNotMasterWhenRunningForceReconfigWhileInDrainMode

Sprint: Repl 2020-05-04
Participants:
Linked BF Score: 42

 Description   

After a node has been elected primary and drained the ops from its buffer, it will check if it needs to run a reconfig to increment its config term. It does this under the replication coordinator mutex, but then releases the lock before running the reconfig. If a force reconfig is running concurrently it may install a new config with term -1 after we do this check and release our lock but before we run the reconfig. If this happens, we will then try to run a reconfig where we set the config version to the version installed by the force reconfig, and the config term to the node's current term. If the force reconfig installed version 'version' and the node's current term is 'term', then we will run a reconfig to (version, term), while our current config is (version, -1). Since we ignore terms for config comparison if either term is -1, this will not pass the validation check that the new config has a newer version and term than the current config. We will return this error and then fassert.

To address this, we may want to consider preventing force reconfigs from running concurrently with a node while in drain mode. For non force reconfigs, we should already prevent this since we check canAcceptNonLocalWrites, but we bypass these checks for force reconfigs, since they can run on a secondary.



 Comments   
Comment by Githook User [ 04/May/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-47636 Handle force reconfig running concurrently with reconfig on step up

(cherry picked from commit 4fe9db02bbd3087c53e481367d2c68e117305557)
Branch: v4.4
https://github.com/mongodb/mongo/commit/2c34abb5bbf0c89cb22d0ce262afc7e70b11787c

Comment by Githook User [ 21/Apr/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-47636 Handle force reconfig running concurrently with reconfig on step up
Branch: master
https://github.com/mongodb/mongo/commit/4fe9db02bbd3087c53e481367d2c68e117305557

Comment by William Schultz (Inactive) [ 18/Apr/20 ]

Sounds good to me. Doing this to allow force reconfigs during a long drain mode makes sense.

Comment by Siyuan Zhou [ 18/Apr/20 ]

I think we can just check whether the config term and version changed since this line. It seems less ideal to ban force reconfig or heartbeat reconfig during primary catchup and drain mode since catchup and drain mode may run for a long time.

Comment by William Schultz (Inactive) [ 17/Apr/20 ]

Yes, that option also seems correct, but it seems essentially to be an alternate implementation of the concurrency control mechanisms we already have in place. Namely, the use of the canAcceptWrites flag to guard step up and drain mode. Checking that the config version and term do not change throughout the duration of step up (which you suggest) is a way to check if any concurrent reconfigs ran during this time. I feel that using the existing _canAcceptNonLocalWrites mechanism is more logical and does not introduce a new means of concurrency control to reason about. Also, I don't see any particular benefit of allowing force reconfigs to run concurrently with step up, so it seems simpler and safer to prevent it entirely, rather than allow them to proceed optimistically and abort the step up reconfig if they conflict with it.

Comment by Siyuan Zhou [ 17/Apr/20 ]

Great catch william.schultz! I agree that's one option. What if we always allow force reconfig and check the config version and term is still the same in getNewConfig. If not, return ConfigurationInProgress which will be ignored in stepup. The reconfig on stepup is expected to fail if there's a concurrent reconfig.

Comment by William Schultz (Inactive) [ 17/Apr/20 ]

While we should fix this, I don't think it's particularly urgent since it only occurs when a force reconfig is running concurrently (on the same node) with an election, so it is probably not hit very often. It causes the current primary to crash which isn't great but there are no other safety/correctness violations. Flagging for scheduling so we can triage it as normal.

Comment by William Schultz (Inactive) [ 17/Apr/20 ]

For non force reconfigs, we do already enforce concurrency control between step up and reconfigs by checking _rsConfigState before running for election, and by checking canAcceptNonLocalWrites before running a reconfig. canAcceptNonLocalWrites is only set to true once a step up is fully complete and we only set our config state back to kConfigSteady once a reconfig is fully complete so this should effectively serialize reconfigs with respect to stepup/elections. Force reconfigs are the exception, though. We cannot make the precondition checks for force reconfig behavior exactly the same as non-force because force reconfigs must be allowed to run on a secondary. I think that we can instead require that force reconfigs can only be run against a fully writable primary or against a node in SECONDARY state. This would serve to serialize force reconfigs with respect to step up on a primary, and still allow force reconfigs on secondary. This would be a relatively simple fix i.e. we could just change this condition to be more strict about force reconfigs. The condition for a force reconfig succeeding would become:

_readWriteAbility->canAcceptNonLocalWrites(lk) || _memberState.secondary();

siyuan.zhou Any thoughts?

Generated at Thu Feb 08 05:14:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.