-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 7.0.0
-
Component/s: None
-
None
-
Catalog and Routing
-
Fully Compatible
-
ALL
-
CAR Team 2024-04-01, CAR Team 2024-04-15
-
27
SERVER-81353 made some changes to the create coordinator (used for shardCollection) that were released in v7.2.0 and that can potentially result in a namespace becoming unavailable when downgrading from v7.2.x to v7.0.
The bug is very improbable to hit considering that a very specific interleaving must occur during downgrade.
Scenario (downgrade is happening, multi-version mix of binaries):
- A sharCollection request is received by a shard in v7.2 that spawns a create coordinator
- An error occurs after acquiring the critical section, that results in calling triggerCleanup that persists on the coordinator document the abort reason
- The shard primary steps down before the coordinator could execute the _cleanupOnAbort procedure introduced by
SERVER-81353 - A new shard primary in v7.0 is elected
- The new primary shard resumes the coordinator, executes _cleanupOnAbort, the default implementation since the create coordinator does not override that method in v7.0.
- The coordinator finishes: DDL locks are released but the recoverable critical section remains indefinitely held
Consequences: can't run CRUDs or DDLs different than shardCollection over the namespace.
Solution: run again the shardCollection command with the original options. This will result in spawn a new coordinator that will reuse the existing critical section and run again, eventually clearing up the state both in case of success or failure.
- is caused by
-
SERVER-81353 Add a clean up procedure to the create collection coordinator
- Closed