Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-91247

Ensure that DDLCoordinator creation does not survive node stepDown-stepUp

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 8.1.0-rc0, 8.0.0-rc8, 7.3.4, 7.0.13
    • Affects Version/s: None
    • Component/s: None
    • None
    • Catalog and Routing
    • Fully Compatible
    • v8.0, v7.3, v7.0, v6.0, v5.0
    • CAR Team 2024-06-10
    • 200

      When we create a DDLCoordinator a lambda is attached to the getConstructionCompletionFuture here.

      Since there is no way to chain the lambda onto the executor that runs the promise of the getConstructionCompletionFuture, so the getInstanceCleanupExecutor() is used as the executor.

      However with this the creation of a DDLCoordinator can survive a stepDown - stepUp phase since the cleanupExector never shut down.

      In a case where the lambda runs after the _status is set to Recovering in the ShardingDDLCoordinatorService::_onServiceInitialization() but before we load the coordinators to recover in the (async task that is created by) ShardingDDLCoordinatorService::_rebuildService then the _numCoordinatorsToWait is 0 as set in the ShardingDDLCoordinatorService::_onServiceTermination() and this invariant fails.

      The fix idea is to use the same executor what is provided by the repl::PrimaryOnlyService as that executor is interrupted on every onSetpDown and joined and recreated in every onStepUp.

      Side note: the same issue happens in the completion future as well here
      Beside fixing the executor here, in the ShardingDDLCoordinatorService::_onServiceTermination() we have to clear the _numActiveCoordinatorsPerType and call _recoveredOrCoordinatorCompletedCV.notify_all(); as well.

            Assignee:
            adam.farkas@mongodb.com Wolfee Farkas
            Reporter:
            adam.farkas@mongodb.com Wolfee Farkas
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: