SERVER-71823 added consistency checks on sharding metadata for a cluster, database, or collection. If issued cluster wise, the algorithm will execute serially the following way:
1. for each DB:
2. acquire the DB DDL lock
3. execute the check for that DB
A customer that would like to use checkMetadataConsistency in order to detect any problem that could come up in a cluster, could under some circumstances end up blocking other DDL operations for a long time, which is unacceptable for a live cluster. In particular, if we have a long standing DDL that takes a DB lock (like resharding) and then a checkMetadataConsistency request comes in, step 2 it would enque a database lock that would prevent other DDL that also take database locks to execute, in particular:
- renameCollection
- shardCollection
- collMod
- dropCollection
- dropDatabase
- dropIndexes
- reshardCollection
- movePrimary
- createCollection (Only affected from v8.0, both implicit and explicit collection creation, but when done outside of a transaction or retryable write)
SERVER-92182 partially handled this in master, by adding a backoff mechanism in step 2, however, this does not address the full execution of the operation (step 3) nor older versions. Currently it is possible to assign a maxTimeMS to the command, but considering this would affect the duration of the entire verification, we want a more fine-grained parameter (specifically for steps 2 and 3).
The purpose of this ticket is to add a new parameter to the command that would limit the total ammount of time other DDL operations might get blocked (that is, the total duration of steps 2 and 3), and fail the command once the time limit is reached.