[SERVER-73218] Deadlock among RecoverRefreshThread, index build, step down, and prepared transaction Created: 23/Jan/23  Updated: 29/Oct/23  Resolved: 27/Jan/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.3.0-rc0

Type: Bug Priority: Major - P3
Reporter: Gregory Noma Assignee: Jordi Serra Torrens
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-71444 [Sharding] Remove or document instanc... Open
Assigned Teams:
Sharding EMEA
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding EMEA 2023-02-06
Participants:
Linked BF Score: 124

 Description   

A prepared transaction holds an IX lock on its collection. An index build is about to run its second drain phase and enqueues a collection S lock (behind the prepared transaction) after explicitly unlocking the RSTL. The RecoverRefreshThread runs, taking the RSTL in IX and (uninterruptiblty) enqueuing a collection IX lock. (Since the previously enqueued index build collection lock is mode S, this IX lock will queue behind it despite the lock granted to the prepared transaction also being IX.) Then a step down occurs, enqueueing the RSTL in X mode behind RecoverRefreshThread. The RecoverRefreshThread is explicitly marked as being killable by step down, but the lock acquisition is uninterruptible. Thus a four-way deadlock has been reached.



 Comments   
Comment by Githook User [ 27/Jan/23 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-73218 Make shard/db metadata refresh interruptible
Branch: master
https://github.com/mongodb/mongo/commit/1c38cfa4a24055d6804a5fac121cddd29b4e8c3a

Comment by Jordi Serra Torrens [ 25/Jan/23 ]

Hey, we want to understand why the RecoverRefreshThread is not interruptible.

I'd say the only reason was to ensure this was always executed to signal that this refresh has finished and new ones can start. In the PR above I propose an alternative approach that does not require uninterruptability.

Edit: We opted for a simpler approach where we keep the uninterruptible guard but remove the need to acquire the DB/Collection lock under it (it was unnecessary to begin with). Now under the uninterruptible lock guard all we take is a RESOURCE_MUTEX.

Comment by Fausto Leyva (Inactive) [ 24/Jan/23 ]

Hey, we want to understand why the RecoverRefreshThread is not interruptible.

Generated at Thu Feb 08 06:24:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.