Upon stepping down, DDL coordinators are currently executing the onCompletion logic that includes the release of distributed locks.
This is not safe, as the following interleaving may happen:
- Node0 is primary and starts a DDL operation
- Node0 steps down, the DDL coordinator enters the onCompletion
- Node1 steps up, the coordinator is resumed and dist locks are reacquired
- Node0 is still executing the onCompletion and releases the distributed locks
- Node1 is running a DDL operation without holding distributed locks
Different approaches may be evaluated for solving the problem, for example:
- Custom logic for coordinators to don't release distributed locks if a coordinator document still exists (may require a modification of the ScopedDistLock destructor)
- Reason in a more general way: shall a node stepping down ever release distributed locks since all operations using a dist lock are going to be resumed by the new primary?