moveChunk and movePrimary lock contention during critical section

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Catalog and Routing
    • ALL
    • CAR Team 2025-07-07
    • 200
    • None
    • 3
    • TBD
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      A deadlock scenario can occur in the following sequence of events:

      1. The balancer initiates a MoveChunk operation, and the donor enters the critical section.
      2. The MoveChunk donor sends "_recvChunkCommit" to the recipient, attempting to engage the recipient's critical section.
      3. Concurrently, MovePrimary acquires an X lock here.
      4. _recvChunkCommit times out after ~6 minutes while waiting to acquire the recipient's critical section, likely due to lock acquisition here.
      5. The MoveChunk donor releases the critical section during cleanup, and the balancer round ends.
      6. MovePrimary proceeds with cloning catalog data and eventually completes, albeit very slowly (~11 minutes).

      The prolonged MovePrimary operation blocks other DDL operations, causing lock acquisition failures, as seen in BF-38131 and BF-36950:

      Failed to acquire DDL lock for 'test12_fsmdb0' in mode IX after 300001 ms that is currently locked by '[{ operation: { namespace: \"test12_fsmdb0\", operationType: \"movePrimary\" }, lock mode: X }]' 

       

      Investigate the optimal interaction between MoveChunk and MovePrimary, particularly whether the MovePrimary coordinator should wait for ongoing balancer MoveChunk operations to complete before starting its work.

        1. bf_38131_repro.diff
          6 kB
          Kruti Shah

            Assignee:
            Marcos José Grillo Ramirez
            Reporter:
            Kruti Shah
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: