AddShard coordinator may exit without proper cleanup leaving setFCV and user writes blocked

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Catalog and Routing
    • ALL
    • 🟩 Routing and Topology
    • None
    • None
    • None
    • None
    • None
    • None

      Similar to Similar to SERVER-117612 (ConvertToCappedCoordinator), SERVER-117613 (CreateCollectionCoordinator), and SERVER-117615 (RefineCollectionShardKeyCoordinator), the AddShardCoordinator has a gap in its mustAlwaysMakeProgress logic which can allow the coordinator to exit without cleaning up.

      The current implementation uses mustAlwaysMakeProgress from the kPrepareNewShard phase and relies on triggerCleanup in the onError to force the repeat of the cleanup in the kCheckShardPreconditions phase. If an error occurs after blocking setFCV and user writes AND cleanup fails before persisting the abort reason, the coordinator gives up without cleaning up.

            Assignee:
            Unassigned
            Reporter:
            Allison Easton
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: