Investigate if RefineCollectionShardKeyCoordinator may exit without proper cleanup leaving migrations frozen

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Catalog and Routing
    • CAR Team 2026-02-16
    • 🟥 DDL
    • None
    • None
    • None
    • None
    • None
    • None

      Similar to SERVER-117612 (ConvertToCappedCoordinator) and SERVER-117613 (CreateCollectionCoordinator), RefineCollectionShardKeyCoordinator may have a gap in its _mustAlwaysMakeProgress() logic that could allow the coordinator to exit without resuming migrations.

      Suspected Issue:

      The current implementation uses > instead of >=:

      bool _mustAlwaysMakeProgress() override {    
          return _doc.getPhase() >  Phase::kRemoteIndexValidation;
      }

      Migrations are stopped at kRemoteIndexValidation. However, because the check uses > (greater than) instead of >= (greater than or equal), _mustAlwaysMakeProgress() returns false during the exact phase where migrations are frozen.

      If a non-retriable error occurs at kRemoteIndexValidation after stopMigrations() is called and the cleanup fails before resumeMigrations() executes, the coordinator may exit leaving migrations frozen.

            Assignee:
            Marcos José Grillo Ramirez
            Reporter:
            Meryama Nadim
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: