Sharded timeseries collMod can abort without releasing participant critical sections

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Catalog and Routing
    • ALL
    • 🟥 DDL
    • None
    • None
    • None
    • None
    • None
    • None

      A sharded time-series collMod can leave shard participant critical sections held if it hits a non-retriable error after the kBlockShards phase.

      CollModCoordinator enters the critical section in kBlockShards by sending _shardsvrParticipantBlock with ReadsAndWrites to shards owning chunks. It only releases that critical section later as a side effect of _shardsvrCollModParticipant in kUpdateShards, when the participant sees needsUnblock=true and calls ShardingRecoveryService::releaseRecoverableCriticalSection().

      There is no dedicated ReleaseCritSec phase and no collMod-specific _cleanupOnAbort() path. If a non-retriable error occurs after kBlockShards but before all participant _shardsvrCollModParticipant commands successfully release their critical sections, the coordinator can abort. The existing non-retriable error handling resumes migrations, but it does not send an unblock command or otherwise release the recoverable critical sections.

      Result: CRUD may remain blocked on affected shards even though the collMod coordinator has aborted.

            Assignee:
            Unassigned
            Reporter:
            Pol Pinol
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: