-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Catalog and Routing
-
ALL
-
CAR Team 2025-07-07
-
200
-
None
-
3
-
TBD
-
None
-
None
-
None
-
None
-
None
-
None
-
None
A deadlock scenario can occur in the following sequence of events:
- The balancer initiates a MoveChunk operation, and the donor enters the critical section.
- The MoveChunk donor sends "_recvChunkCommit" to the recipient, attempting to engage the recipient's critical section.
- Concurrently, MovePrimary acquires an X lock here.
- _recvChunkCommit times out after ~6 minutes while waiting to acquire the recipient's critical section, likely due to lock acquisition here.
- The MoveChunk donor releases the critical section during cleanup, and the balancer round ends.
- MovePrimary proceeds with cloning catalog data and eventually completes, albeit very slowly (~11 minutes).
The prolonged MovePrimary operation blocks other DDL operations, causing lock acquisition failures, as seen in BF-38131 and BF-36950:
Failed to acquire DDL lock for 'test12_fsmdb0' in mode IX after 300001 ms that is currently locked by '[{ operation: { namespace: \"test12_fsmdb0\", operationType: \"movePrimary\" }, lock mode: X }]'
Investigate the optimal interaction between MoveChunk and MovePrimary, particularly whether the MovePrimary coordinator should wait for ongoing balancer MoveChunk operations to complete before starting its work.