[SERVER-45845] TransactionCoordinator stepUp can deadlock Created: 29/Jan/20  Updated: 06/Dec/22  Resolved: 19/Feb/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-45953 Exempt oplog readers from acquiring r... Closed
Duplicate
duplicates SERVER-45953 Exempt oplog readers from acquiring r... Closed
Assigned Teams:
Sharding
Operating System: ALL
Participants:
Case:

 Description   

Scenario:

  • a transaction on ns foo.bar is on prepare
  • new primary just stepped up on this shard

Sequence of events to deadlock:
1. The new primary's TransactionParticipants make sure necessary locks are acquired for the prepared txn.
2. An operation makes a write, generating a new oplog and advancing last op timestamp.
3. An operation requiring a conflicting exclusive lock arrives on the new primary.
4. Multiple operations conflicting with the exclusive lock also arrives, blocking behind the lock request of operation in #3. The numbers came in enough to exhaust the read ticket.
5. TransactionCoordinatorService stepUp code kicks in, tries to wait for last op to become majority committed.
6. Secondaries try to fetch oplog from new primary but can't query the primary because the read ticket is already exhausted. So majority timestamp won't advance.
7. Retried CoordinatorCommit command for the prepared transaction arrives tries to wait for TransactionCoordinatorService to fully step up before proceeding. Deadlock occurs. Also note that TransactionCoordinatorService will also try to start coordinating in progress coordinators after waiting for majority.



 Comments   
Comment by Esha Maharishi (Inactive) [ 19/Feb/20 ]

renctan ok, closing as dup of SERVER-45953. If you think we shouldn't do that for any reason, please do reopen this ticket.

Comment by Randolph Tan [ 19/Feb/20 ]

esha.maharishi I believe fixing SERVER-45953 will break the cyclic dependency in the deadlock chain.

Comment by Esha Maharishi (Inactive) [ 14/Feb/20 ]

renctan, is there work to do after SERVER-45953?

Generated at Thu Feb 08 05:09:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.