[SERVER-65821] Deadlock during setFCV when there are prepared transactions that have not persisted commit/abort decision Created: 20/Apr/22 Updated: 29/Oct/23 Resolved: 09/May/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 6.0.0-rc1, 5.3.0, 4.4.0, 5.0.0 |
| Fix Version/s: | 5.3.2, 6.0.0-rc5, 4.4.15, 5.0.10, 6.1.0-rc0 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Cheahuychou Mao | Assignee: | Gregory Noma |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v6.0, v5.3, v5.0, v4.4
|
||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Execution Team 2022-05-02, Execution Team 2022-05-16 | ||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 169 | ||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Here are the steps to reproduce the deadlock:
|
| Comments |
| Comment by Githook User [ 01/Jun/22 ] |
|
Author: {'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}Message: (cherry picked from commit 5f15e515c617fca69a4a6dc4be741c19e2d07aa8) |
| Comment by Githook User [ 26/May/22 ] |
|
Author: {'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}Message: (cherry picked from commit 5f15e515c617fca69a4a6dc4be741c19e2d07aa8) |
| Comment by Githook User [ 10/May/22 ] |
|
Author: {'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}Message: (cherry picked from commit 5f15e515c617fca69a4a6dc4be741c19e2d07aa8) |
| Comment by Githook User [ 10/May/22 ] |
|
Author: {'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}Message: (cherry picked from commit 5f15e515c617fca69a4a6dc4be741c19e2d07aa8) |
| Comment by Gregory Noma [ 09/May/22 ] |
|
The solution we've implemented for now is to have a new global lock resource, resourceIdFeatureCompatibilityVersion, which gets implicitly acquired in MODE_IX when acquiring a global lock in MODE_IX or MODE_X. The setFCV command acquires this new resource in MODE_S as a barrier, rather than using the global lock for this purpose. Then the journal flusher opts out of conflicting with setFCV, as does the transaction coordinator when writing the commit decision. This allows the transaction to commit and subsequently allows the setFCV to complete. However, going forward we would like to come up with a more permanent solution to this issue. One potential idea is to combine the transaction participant and transaction coordinator on the coordinating shard. A change would also need to be made to the journal flusher. Potentially instead of necessarily being its own thread, any operation that currently needs to wait for the journal flusher to run could instead pick up the work using its own resources. |
| Comment by Githook User [ 09/May/22 ] |
|
Author: {'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}Message: |