[SERVER-46517] Stepdown changes the underlying state of canAcceptWrites() out of RSTL X mode Created: 01/Mar/20  Updated: 29/Oct/23  Resolved: 19/Mar/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.2.6, 4.4.0-rc0, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Siyuan Zhou Assignee: Lingzhi Deng
Resolution: Fixed Votes: 0
Labels: safe-reconfig-related
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2
Sprint: Repl 2020-03-23
Participants:

 Description   

Unconditional stepdown on learning of higher terms and relinquishing primary due to liveness check can change _leaderMode to kSteppingDown, then unlock the replCoord mutex to continue stepdown. A concurrent reconfig may acquire the lock after that, call _updateMemberStateFromTopologyCoordinator which sets canAcceptNonLocalWrites to the topology coordinator's canAcceptWrites():

bool TopologyCoordinator::canAcceptWrites() const {
    return _leaderMode == LeaderMode::kMaster;
}

Since _leaderMode has been changed, the reconfig thread picks up the half-work done by stepdown and continues to update canAcceptNonLocalWrites to false out of the RSTL X mode.

The contract is canAcceptNonLocalWrites has to be updated in RSTL X mode and is violated here, failing an invariant.

SERVER-45081 works around this by only updating canAcceptNonLocalWrites when RSTL X is acquired, so the work will be left to the stepdown thread.

There are several solutions to fix the issue in a holistic way:

  1. Move the update of readWriteAbility out of _updateMemberStateFromTopologyCoordinator, so it's only called when changed.
  2. Don't change _leaderMode to kSteppingDown before acquiring RSTL. We need to rethink the concurrency of stepdown then.

The concurrency rule of _updateMemberStateFromTopologyCoordinator is whenever the topology coordinator states depended by _updateMemberStateFromTopologyCoordinator gets changed, this function should be called within the same lock scope. This issue violates this rule.



 Comments   
Comment by Githook User [ 07/Apr/20 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-46517: Move the update of readWriteAbility out of _updateMemberStateFromTopologyCoordinator

(cherry picked from commit 6d0a10abd1e6f222bc16c59afc28dcfb9613b86f)
Branch: v4.2
https://github.com/mongodb/mongo/commit/8e2737e5b88a0f639b77b51f7b57730f337fea55

Comment by Githook User [ 25/Mar/20 ]

Author:

{'email': 'lingzhi.deng@mongodb.com', 'name': 'Lingzhi Deng', 'username': 'ldennis'}

Message: SERVER-46517: Clean up incrementTopologyVersion and setFollowerModeStrict

(cherry picked from commit 994c78a1a36c006ad659983e2f0a3cba7a6dea41)
Branch: v4.4
https://github.com/mongodb/mongo/commit/17ea43bf2e5e61663f115a3148e948da4028275e

Comment by Githook User [ 25/Mar/20 ]

Author:

{'email': 'lingzhi.deng@mongodb.com', 'name': 'Lingzhi Deng', 'username': 'ldennis'}

Message: SERVER-46517: Move the update of readWriteAbility out of _updateMemberStateFromTopologyCoordinator

(cherry picked from commit 6d0a10abd1e6f222bc16c59afc28dcfb9613b86f)
Branch: v4.4
https://github.com/mongodb/mongo/commit/b973f465871f19815f8a5b60b6aeffb10bd1cb78

Comment by Githook User [ 19/Mar/20 ]

Author:

{'email': 'lingzhi.deng@mongodb.com', 'name': 'Lingzhi Deng', 'username': 'ldennis'}

Message: SERVER-46517: Clean up incrementTopologyVersion and setFollowerModeStrict
Branch: master
https://github.com/mongodb/mongo/commit/994c78a1a36c006ad659983e2f0a3cba7a6dea41

Comment by Githook User [ 19/Mar/20 ]

Author:

{'email': 'lingzhi.deng@mongodb.com', 'name': 'Lingzhi Deng', 'username': 'ldennis'}

Message: SERVER-46517: Move the update of readWriteAbility out of _updateMemberStateFromTopologyCoordinator
Branch: master
https://github.com/mongodb/mongo/commit/6d0a10abd1e6f222bc16c59afc28dcfb9613b86f

Comment by A. Jesse Jiryu Davis [ 11/Mar/20 ]

I agree, Option 1 looks like a general improvement.

On Wed, Mar 11, 2020 at 11:48 AM Lingzhi Deng (Jira) <jira@mongodb.org>

Comment by A. Jesse Jiryu Davis [ 11/Mar/20 ]

SERVER-46381 includes a test that triggered this invariant. After Siyuan's fix in SERVER-45081 the test passes.

Generated at Thu Feb 08 05:11:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.