[SERVER-76834] Circular wait dependency between PBWM and ticket acquisition Created: 04/May/23 Updated: 01/Dec/23 Resolved: 04/Aug/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | 7.0.0-rc0, 6.0.5, 5.0.17, 4.4.21, 6.3.1 |
| Fix Version/s: | 7.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | Gregory Noma |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Storage Execution
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v7.0
|
||||||||||||||||||||
| Sprint: | Execution NAMR Team 2023-08-21 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 107 | ||||||||||||||||||||
| Description |
|
Any thread which takes the PBWM and a ticket has the potential to introduce a three-way deadlock in the presence of prepared transactions. The deadlock can be summarised like this:
We now have a wait cycle of (T3) -> (T2) -> (T1) -> (T3). This demonstrates that it is not safe to block on a ticket acquisition while holding the PBWM and there are at least two places where this can happen:
This (more general) ticket is for the StorEx team to tighten up the locking rules in order to prevent deadlocks like these from being introduced. One possibility is to add an invariant that we can't take both PBWM and a ticket - a thread must choose one. |
| Comments |
| Comment by Gregory Noma [ 04/Aug/23 ] |
|
Now that we no longer take the PBWM as of |
| Comment by Connie Chen [ 08/May/23 ] |
|
We will plan to backport this to 7.0, but we do not think this is a 7.0.0 blocker for code freeze since Kal is working on the known instance in |
| Comment by Kaloian Manassiev [ 05/May/23 ] |
|
This problem exists in all versions where the PBWM exists, but 7.0 would be the most impacted since it significantly lowered the number of available ticket. We will not backport the removal of the PBWM there, so that cannot be the solution. At the very least we should have a uassert somewhere if that happens so the server doesn't stall. I am removing the Epic and making it "Related To" to ensure this ticket shows up on StorEx's triage. |
| Comment by Fausto Leyva (Inactive) [ 04/May/23 ] |
|
We would rather remove the PBWM lock. |