[SERVER-65821] Deadlock during setFCV when there are prepared transactions that have not persisted commit/abort decision Created: 20/Apr/22  Updated: 29/Oct/23  Resolved: 09/May/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 6.0.0-rc1, 5.3.0, 4.4.0, 5.0.0
Fix Version/s: 5.3.2, 6.0.0-rc5, 4.4.15, 5.0.10, 6.1.0-rc0

Type: Bug Priority: Critical - P2
Reporter: Cheahuychou Mao Assignee: Gregory Noma
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Problem/Incident
causes SERVER-75205 Deadlock between stepdown and restori... Closed
Related
related to SERVER-66719 dbCheck FCV lock upgrade causes deadl... Closed
related to SERVER-66213 setFCV may need to wait for transacti... Open
is related to SERVER-60682 TransactionCoordinator may block acqu... Closed
is related to SERVER-57476 Operation may block on prepare confli... Closed
is related to SERVER-66340 Improve distributed transaction commi... Closed
is related to SERVER-66341 Improve journal flusher locking behavior Closed
is related to SERVER-66342 Remove resourceIdFeatureCompatibility... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.0, v5.3, v5.0, v4.4
Sprint: Execution Team 2022-05-02, Execution Team 2022-05-16
Participants:
Case:
Linked BF Score: 169

 Description   

Here are the steps to reproduce the deadlock:

  • Run a cross-shard transaction with two participant shards, shard0 and shard1 where shard0 is the coordinator shard. Pause the TransactionCoordinator thread right before the commit decision is written (i.e. after the transaction has entered the "prepared" state).
  • Run a setFCV command against shard0. Wait until the setFCV thread is blocked waiting to acquire the global S lock (i.e. waiting for prepared transactions that existed before the FCV change to commit or abort).
  • Unpause the TransactionCoordinator thread. The transaction cannot commit since the TransactionCoordinator is blocked waiting to acquire the IX lock for the config.transaction_coordinators collection to write the commit decision.
  • Both the setFCV thread and TransactionCoordinator thread now hang.


 Comments   
Comment by Githook User [ 01/Jun/22 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-65821 Allow prepared transactions to write commit decision across FCV barrier

(cherry picked from commit 5f15e515c617fca69a4a6dc4be741c19e2d07aa8)
(cherry picked from commit 1c3268ae7fd8ffd678c20d5f2ac977be2a2c982f)
Branch: v4.4
https://github.com/mongodb/mongo/commit/f8b07d496fe3f8559c2ce505c2bd2787b58342ab

Comment by Githook User [ 26/May/22 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-65821 Allow prepared transactions to write commit decision across FCV barrier

(cherry picked from commit 5f15e515c617fca69a4a6dc4be741c19e2d07aa8)
(cherry picked from commit 1c3268ae7fd8ffd678c20d5f2ac977be2a2c982f)
Branch: v5.0
https://github.com/mongodb/mongo/commit/1afb8d3b1efd8b02f3d09af822f9a54a8c9f6f3d

Comment by Githook User [ 10/May/22 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-65821 Allow prepared transactions to write commit decision across FCV barrier

(cherry picked from commit 5f15e515c617fca69a4a6dc4be741c19e2d07aa8)
Branch: v5.3
https://github.com/mongodb/mongo/commit/007846ad9c22495af276034c6d855572947e2742

Comment by Githook User [ 10/May/22 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-65821 Allow prepared transactions to write commit decision across FCV barrier

(cherry picked from commit 5f15e515c617fca69a4a6dc4be741c19e2d07aa8)
Branch: v6.0
https://github.com/mongodb/mongo/commit/d9acfef2b9bd2f3715906596bbf566e6023d8a64

Comment by Gregory Noma [ 09/May/22 ]

The solution we've implemented for now is to have a new global lock resource, resourceIdFeatureCompatibilityVersion, which gets implicitly acquired in MODE_IX when acquiring a global lock in MODE_IX or MODE_X. The setFCV command acquires this new resource in MODE_S as a barrier, rather than using the global lock for this purpose. Then the journal flusher opts out of conflicting with setFCV, as does the transaction coordinator when writing the commit decision. This allows the transaction to commit and subsequently allows the setFCV to complete.

However, going forward we would like to come up with a more permanent solution to this issue. One potential idea is to combine the transaction participant and transaction coordinator on the coordinating shard. A change would also need to be made to the journal flusher. Potentially instead of necessarily being its own thread, any operation that currently needs to wait for the journal flusher to run could instead pick up the work using its own resources.

Comment by Githook User [ 09/May/22 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-65821 Allow prepared transactions to write commit decision across FCV barrier
Branch: master
https://github.com/mongodb/mongo/commit/5f15e515c617fca69a4a6dc4be741c19e2d07aa8

Generated at Thu Feb 08 06:03:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.