Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-90330

Creation of DDL coordinator hang indefinetly if executed on secondary node

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 8.1.0-rc0, 8.0.0-rc7, 7.3.4, 7.0.13
    • Affects Version/s: 5.0.0, 6.0.0, 7.0.0, 8.0.0-rc0, 7.3.0
    • Component/s: None
    • None
    • Catalog and Routing
    • Fully Compatible
    • ALL
    • v8.0, v7.3, v7.0, v6.0, v5.0
    • CAR Team 2024-05-13, CAR Team 2024-05-27
    • 200

      Creation of DDL coordinator is done through the ShardingDDLCoordinator::getOrCreate function.

      This function internally calls ShardingDDLCoordinator::waitForRecoveryCompletion to wait for the service to complete recovery and reach a stable state before to create new coordinator. This is to avoid acquisition of DDL lock (perform by each DDL coordinator instance) before all the previously spawned coordinator have been recovered and acquired their respective DDL locks.

      The waitForRecoveryCompletion funciton waits until the service reach the _state == kRecovered.
      If this function is called while the node is secondary the state will be kPaused and it will not become kRecovered until the node get elected primary again.


      Looking closely at this code, I spot another issue. Since we are not holding the _state lock, there is no guarantee that in between:

      1. Call to waitForRecoveryCompletion()
      2. And the actual creation of the coordinator

      The _state of the service will change back to kRecovering. In fact it could be that after 1. the node steps down (kRecovered -> kPaused) and then step up again (kPaused -> kRecovering) before executing 2.
      This second issue is highly unprobable because we would need to execute a full cycle of stepdown and stepup in few milliseconds.

            Assignee:
            tommaso.tocci@mongodb.com Tommaso Tocci
            Reporter:
            tommaso.tocci@mongodb.com Tommaso Tocci
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: