Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-58592

Make ReshardingCoordinatorService more robust when stepdowns happen near the end of a resharding operation.

    • Fully Compatible
    • v5.0
    • Sharding 2021-07-26, Sharding 2021-08-09
    • 120
    • 2

      In our current implemention for the resharding coordinator, when resharding is done, we first remove the on-disk coordinator document and then clean the in-memory state (i.e completing/stepping down the metrics). This can cause issues. Consider the case in the BF. There is a stepdown after the coordinator document has been deleted but before the in-memory state has been cleaned. Since the coordinator document has been deleted, this instance is removed from the _activeInstances map in PrimaryOnlyService by the PrimaryOnlyServiceOpObserver. After this config server primary (referred to as primary_1 from here) steps down, a new primary will stepup. Since the old document and instance was deleted, this new primary won't resume the same resharding operation and will wait for the next resharding operation. When primary_1 steps up again as a primary, it will still have the not cleaned in-memory state from the original resharding operation which will conflict with the in-memory state of any new resharding operation.

            randolph@mongodb.com Randolph Tan
            kshitij.gupta@mongodb.com Kshitij Gupta
            0 Vote for this issue
            2 Start watching this issue