[DOCS-14892] Investigate changes in SERVER-59965: Distributed deadlock between renameCollection and multi-shard transaction Created: 25/Oct/21  Updated: 13/Nov/23  Resolved: 23/Feb/22

Status: Closed
Project: Documentation
Component/s: manual, Server
Affects Version/s: None
Fix Version/s: 5.0.4, 5.2.0, 5.1.0-rc3, Server_Docs_20231030, Server_Docs_20231106, Server_Docs_20231105, Server_Docs_20231113

Type: Task Priority: Major - P3
Reporter: Backlog - Core Eng Program Management Team Assignee: Dave Cuthbert (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
documents SERVER-59965 Distributed deadlock between renameCo... Closed
Related
is related to DOCS-14907 [BACKPORT] [v5.0] Distributed deadloc... Closed
Participants:
Days since reply: 1 year, 50 weeks ago
Epic Link: DOCSP-19447

 Description   
Downstream Change Summary

Added new 'metadataRefreshInTransactionMaxWaitBehindCritSecMS' server parameter.

Description of Linked Ticket

As part of a sharded renameCollection, the DDLCoordinator instructs all participant shards to enter their critical sections. When all shards have entered it, the coordinator will do some work on the configsvr and finally it will tell the shards to leave their critical section.

When running renameCollection concurrently with multi-shard transactions that affect that same collection, there exists a particular interleaving that can lead to a distributed deadlock:
1. shard0 receives the RenameCollectionParticipant command and enters its critical section
2. shard0 attempts to run an statement of the multi-shard txn. Since the critical section is taken, it will throw StaleConfig. This error will be caught on the way out of the command and it will attempt to refresh the shardVersion. However, since the critical section is taken, the refresh will block until the critical section is released.
3. shard1 runs it's part of that multi-shard transaction, which will acquire the collection lock in MODE_IX, and then stash the locks.
4. shard1 receives the RenameCollectionParticipant and attempts to enter the critical section. However, since the transaction at point 3 had stashed the collection lock, we are not able to acquire the collection lock in MODE_S needed to enter the critical section.

At this point we are deadlocked:

  • shard0 is holding the critical section and won't release until shard1 acquires theirs.
  • shard1 Is holding the collection lock in MODE_IX until the txn gets committed, which won't happen because the txn (or perhaps, rather the refresh) is not making progress on shard0 due to the critical section.

More generally, I believe this situation can occur in any DDL operation that needs to acquire the critical section in several nodes at the same time. I believe that resharding may also be affected by this.



 Comments   
Comment by Githook User [ 23/Feb/22 ]

Author:

{'name': 'Dave', 'email': '69165704+davemungo@users.noreply.github.com', 'username': 'davemungo'}

Message: DOCS-14892 BACKPORT (#691)

  • Remove 5.2 release notes
Comment by Githook User [ 23/Feb/22 ]

Author:

{'name': 'Dave', 'email': '69165704+davemungo@users.noreply.github.com', 'username': 'davemungo'}

Message: DOCS-14892 BACKPORT (#690)
Branch: v5.2
https://github.com/10gen/docs-mongodb-internal/commit/a2b88baa7c4dd6facf545813756b66a34d8762a9

Comment by Githook User [ 23/Feb/22 ]

Author:

{'name': 'Dave', 'email': '69165704+davemungo@users.noreply.github.com', 'username': 'davemungo'}

Message: DOCS-14892 distributed deadlock in transactions v5.3 (#678)

  • Staging fixes
  • Staging fixes
  • Review feedback
  • Staging fixes
  • Review feedback
Comment by PM Bot [ 25/Oct/21 ]

Downstream changes updated for upstream SERVER-59965:
Added new 'metadataRefreshInTransactionMaxWaitBehindCritSecMS' server parameter.

Generated at Thu Feb 08 08:11:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.