Ungated Resharding Registry's resyncFromDisk() can cause unavailability

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • 9.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • Fully Compatible
    • ALL
    • ClusterScalability 8Jun-22June, ClusterScalability 22Jun-6Jul
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      LocalReshardingOperationsRegistry relies on onConsistentDataAvailable callback from ReplicaSetAwareService to resync latest resharding operation(s) state from disk on rollbacks/initial-sync. The call currently is not gated behind the feature flag and this can lead to the following sequence of events:

      1. Rollback/restart on v9.0 binary (FCV < v9.0) while a resharding operation is in progress.
      2. Registry populates the state of the operation.
      3. The resharding operation completes but the ReshardingOpObserver doesn't cleanup the operation from registry as it is gated behind feature flag which is off before setFCV for 9.0 is called.
      4. setFCV upgrades the FCV and featureFlagReshardingRegistry is enabled.
      5. Subsequent reshardCollection on same nss will fail continuously with PrimarySteppedDown due to orphaned entry in the registry as only one active resharding operation is allowed per current contract.

      Additionally createIndex and dropIndex operations on the same nss are rejected with ReshardCollectionInProgress.

      The workaround is to clear in-memory state of registry with a stepdown.

            Assignee:
            Abdul Qadeer
            Reporter:
            Abdul Qadeer
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: