Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-46517

Stepdown changes the underlying state of canAcceptWrites() out of RSTL X mode

    • Fully Compatible
    • ALL
    • v4.4, v4.2
    • Repl 2020-03-23

      Unconditional stepdown on learning of higher terms and relinquishing primary due to liveness check can change _leaderMode to kSteppingDown, then unlock the replCoord mutex to continue stepdown. A concurrent reconfig may acquire the lock after that, call _updateMemberStateFromTopologyCoordinator which sets canAcceptNonLocalWrites to the topology coordinator's canAcceptWrites():

      bool TopologyCoordinator::canAcceptWrites() const {
          return _leaderMode == LeaderMode::kMaster;
      }
      

      Since _leaderMode has been changed, the reconfig thread picks up the half-work done by stepdown and continues to update canAcceptNonLocalWrites to false out of the RSTL X mode.

      The contract is canAcceptNonLocalWrites has to be updated in RSTL X mode and is violated here, failing an invariant.

      SERVER-45081 works around this by only updating canAcceptNonLocalWrites when RSTL X is acquired, so the work will be left to the stepdown thread.

      There are several solutions to fix the issue in a holistic way:

      1. Move the update of readWriteAbility out of _updateMemberStateFromTopologyCoordinator, so it's only called when changed.
      2. Don't change _leaderMode to kSteppingDown before acquiring RSTL. We need to rethink the concurrency of stepdown then.

      The concurrency rule of _updateMemberStateFromTopologyCoordinator is whenever the topology coordinator states depended by _updateMemberStateFromTopologyCoordinator gets changed, this function should be called within the same lock scope. This issue violates this rule.

            Assignee:
            lingzhi.deng@mongodb.com Lingzhi Deng
            Reporter:
            siyuan.zhou@mongodb.com Siyuan Zhou
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: