[SERVER-73218] Deadlock among RecoverRefreshThread, index build, step down, and prepared transaction Created: 23/Jan/23 Updated: 29/Oct/23 Resolved: 27/Jan/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 6.3.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Gregory Noma | Assignee: | Jordi Serra Torrens |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Sharding EMEA
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Sprint: | Sharding EMEA 2023-02-06 | ||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 124 | ||||||||||||
| Description |
|
A prepared transaction holds an IX lock on its collection. An index build is about to run its second drain phase and enqueues a collection S lock (behind the prepared transaction) after explicitly unlocking the RSTL. The RecoverRefreshThread runs, taking the RSTL in IX and (uninterruptiblty) enqueuing a collection IX lock. (Since the previously enqueued index build collection lock is mode S, this IX lock will queue behind it despite the lock granted to the prepared transaction also being IX.) Then a step down occurs, enqueueing the RSTL in X mode behind RecoverRefreshThread. The RecoverRefreshThread is explicitly marked as being killable by step down, but the lock acquisition is uninterruptible. Thus a four-way deadlock has been reached. |
| Comments |
| Comment by Githook User [ 27/Jan/23 ] |
|
Author: {'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}Message: |
| Comment by Jordi Serra Torrens [ 25/Jan/23 ] |
I'd say the only reason was to ensure this was always executed to signal that this refresh has finished and new ones can start. In the PR above I propose an alternative approach that does not require uninterruptability. Edit: We opted for a simpler approach where we keep the uninterruptible guard but remove the need to acquire the DB/Collection lock under it (it was unnecessary to begin with). Now under the uninterruptible lock guard all we take is a RESOURCE_MUTEX. |
| Comment by Fausto Leyva (Inactive) [ 24/Jan/23 ] |
|
Hey, we want to understand why the RecoverRefreshThread is not interruptible. |