Circular dependency prevents active query checking during transitionToDedicatedConfigServer orphan cleanup

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Catalog and Routing
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      When implementing SERVER-103990, we identified a need to check and wait for ongoing queries before completing orphan cleanup for the most recent non-processing & non-pending range deletion task to keep the same behavior as range deletion for a single task. The ideal approach would be to use CollectionShardingRuntime to verify there are no active queries on primary or secondary nodes using older chunk metadata before proceeding.
      However, attempting to use CollectionShardingRuntime in topology_change_helpers.cpp or sharding_catalog_manager_shard_operations.cpp introduces a circular dependency that causes symbol_checker failures in EVG.

      CollectionShardingRuntime is compiled as part of the sharding_coord_d library.
      sharding_coord_d already depends on sharding_catalog_manager.
      Adding sharding_coord_d as a dependency of sharding_catalog_manager (to access CollectionShardingRuntime) creates a cycle:

        
      sharding_catalog_manager -> sharding_coord_d -> sharding_catalog_manager
      

      The transitionToDedicatedConfigServer logic is currently split between:
      remove_shard_commit_coordinator.cpp
      sharding_catalog_manager_shard_operations.cpp
      This split makes it difficult to cleanly refactor the code to avoid the circular dependency without a significant restructuring effort. While the old code path will be removed from sharding_catalog_manager_shard_operations after removing feature flag for SPM-4017, we still have an issue with shardDrainingStatus.

      Instead of actively checking for ongoing queries, the implementation relies solely on waiting for orphanCleanupDelaySecs to elapse before allowing orphan data cleanup to proceed. This provides a time-based guarantee that queries using older chunk metadata will have completed (or will fail with QueryPlanKilled and can be retried by the user).

            Assignee:
            Unassigned
            Reporter:
            Abdul Qadeer
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: