Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.3.0-rc0
Affects Version/s: 6.0.0, 7.0.0, 8.0.0, 8.1.0
Component/s: Sharding
Labels:
None

Assigned Teams:

Catalog and Routing
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
CAR Team 2025-07-07, CAR Team 2025-07-21
Linked BF Score:
200
Confidence Status:
None
Work Order:
3
Size Category:
TBD
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

A deadlock scenario can occur in the following sequence of events:

The balancer initiates a MoveChunk operation, and the donor enters the critical section.
The MoveChunk donor sends "_recvChunkCommit" to the recipient, attempting to engage the recipient's critical section.
Concurrently, MovePrimary acquires an X lock here.
_recvChunkCommit times out after ~6 minutes while waiting to acquire the recipient's critical section, likely due to lock acquisition here.
The MoveChunk donor releases the critical section during cleanup, and the balancer round ends.
MovePrimary proceeds with cloning catalog data and eventually completes, albeit very slowly (~11 minutes).

The prolonged MovePrimary operation blocks other DDL operations, causing lock acquisition failures, as seen in BF-38131 and BF-36950:

Failed to acquire DDL lock for 'test12_fsmdb0' in mode IX after 300001 ms that is currently locked by '[{ operation: { namespace: \"test12_fsmdb0\", operationType: \"movePrimary\" }, lock mode: X }]'

Investigate the optimal interaction between MoveChunk and MovePrimary, particularly whether the MovePrimary coordinator should wait for ongoing balancer MoveChunk operations to complete before starting its work.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

bf_38131_repro.diff
6 kB
Jun 23 2025 06:54:39 PM UTC

causes

SERVER-107734 BSONObjBuilder could be accessed after being destroyed by obj() call on movePrimary

Closed

is caused by

SERVER-44028 Remove global X lock for Cloner

Closed

is duplicated by

SERVER-107476 Prevent movePrimary from bubbling up WriteConflict exceptions

Closed

SERVER-107734 BSONObjBuilder could be accessed after being destroyed by obj() call on movePrimary

Closed

related to

SERVER-107476 Prevent movePrimary from bubbling up WriteConflict exceptions

Closed

Assignee:: Marcos José Grillo Ramirez
Reporter:: Kruti Shah
Participants:: Githook User, Kruti Shah, Marcos José Grillo Ramirez, Mothra Jira Bot
Votes:: 0 Vote for this issue
Watchers:: 10 Start watching this issue

Created:: Jun 23 2025 06:50:32 PM UTC
Updated:: Aug 07 2025 10:09:22 AM UTC
Resolved:: Aug 07 2025 10:09:22 AM UTC
Confidence Status Last Update:: 25/Jun/25 10:43 AM

Details

Description

Attachments

Attachments

Issue Links

Forms

Activity

People

Dates