-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Catalog and Routing
-
Fully Compatible
-
ALL
-
CAR Team 2024-04-01, CAR Team 2024-04-15
-
9
-
None
-
None
-
None
-
None
-
None
-
None
-
None
A DDL lock should not be acquired when the DDL service state is not kPrimaryAndRecovered.
Small explanation
This is because we must ensure that a DDL operation that is interrupted and has to release the DDL lock due to a step-down will be the next one acquiring that DDL lock (after stepping up). Therefore, no one else must take a DDL lock right after a DDL operation is interrupted.
Here is a sequence of events that will lead to CheckMetadataConsistency to acquire the DDL lock in the middle of a Resharding operation:
- Resharding operation starts and acquires the DDL lock.
- CheckMetadataConsistency starts and gets blocked waiting for the DDL lock.
- Stepdown starts
- Stepdown thread stops all the PrimaryOnlyServices.
- Resharding operation releases its DDL lock because it's a PrimaryOnlyService.
- CheckMetadataConsistency acquires the DDL lock.
- Stepdown thread kills all the interruptible opCtx. This will kill CheckMetadataConsistency operation but it may be late, the operation could have already finished.
Suggested solution
We must ensure the _state is still kPrimaryAndRecovered once the lock is acquired.