Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.2.0, 5.0.4, 5.1.0-rc3
Affects Version/s: None
Component/s: Sharding
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.1, v5.0
Steps To Reproduce:

Hide

0001-SERVER-59965-repro.patch

Show
0001-SERVER-59965-repro.patch
Sprint:
Sharding EMEA 2021-09-20, Sharding EMEA 2021-10-04, Sharding EMEA 2021-10-18, Sharding EMEA 2021-11-01
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

As part of a sharded renameCollection, the DDLCoordinator instructs all participant shards to enter their critical sections. When all shards have entered it, the coordinator will do some work on the configsvr and finally it will tell the shards to leave their critical section.

When running renameCollection concurrently with multi-shard transactions that affect that same collection, there exists a particular interleaving that can lead to a distributed deadlock:
1. shard0 receives the RenameCollectionParticipant command and enters its critical section
2. shard0 attempts to run an statement of the multi-shard txn. Since the critical section is taken, it will throw StaleConfig. This error will be caught on the way out of the command and it will attempt to refresh the shardVersion. However, since the critical section is taken, the refresh will block until the critical section is released.
3. shard1 runs it's part of that multi-shard transaction, which will acquire the collection lock in MODE_IX, and then stash the locks.
4. shard1 receives the RenameCollectionParticipant and attempts to enter the critical section. However, since the transaction at point 3 had stashed the collection lock, we are not able to acquire the collection lock in MODE_S needed to enter the critical section.

At this point we are deadlocked:

shard0 is holding the critical section and won't release until shard1 acquires theirs.
shard1 Is holding the collection lock in MODE_IX until the txn gets committed, which won't happen because the txn (or perhaps, rather the refresh) is not making progress on shard0 due to the critical section.

More generally, I believe this situation can occur in any DDL operation that needs to acquire the critical section in several nodes at the same time. I believe that resharding may also be affected by this.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

0001-SERVER-59965-repro.patch
Sep 15 2021 03:13:04 PM UTC
8 kB
Jordi Serra Torrens

is depended on by

SERVER-58991 Acquire the critical section on the recipient shard of a moveChunk operation

Closed

Assignee:: Jordi Serra Torrens
Reporter:: Jordi Serra Torrens
Participants:: Githook User, Jordi Serra Torrens
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Sep 15 2021 03:11:10 PM UTC
Updated:: Oct 29 2023 09:48:36 PM UTC
Resolved:: Oct 25 2021 07:59:24 AM UTC
Confidence Status Last Update:: 22/Oct/21 8:43 AM

Details

Description

Attachments

Attachments

Issue Links

Forms

Activity

People

Dates