[SERVER-67698] Investigate potential deadlocks in LockState Created: 30/Jun/22  Updated: 01/Aug/22  Resolved: 29/Jul/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Jordi Olivares Provencio Assignee: Geert Bosch
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

It seems that Resource Mutexes are never released during a yield, this has the potential to lead to deadlocks as in the following case we would have one:

Thread A takes Lock A -> Lock B -> Resource Mutex A

Thread A yields

Thread B takes Lock B -> Waits until Resource Mutex A is available

Thread A wakes -> Lock A -> Wait until Lock B is available

 

Additionally it seems that the order of the locks taken is broken when storing in the LockSnapshot here. This is also unsafe, as it can cause deadlocks too as the order of locks when taken is potentially not the same after a yield and reacquiring them.



 Comments   
Comment by Allison Easton [ 01/Aug/22 ]

Sorry about that. No, it was for this ticket. During FCCV upgrade to 6.0, we are setting the orphan counts on the range deletion documents. To do this, we are using an index scan with yield policy YIELD_AUTO. I left the comment here because this ticket came out of an investigation into whether we could be yielding during that scan, since it could result in incorrect orphan counts.

It turned out not to be a problem, but that is because the lock we are using for that code takes an X resource lock which isn't yielded since resource mutexes aren't currently yielded.

I just wanted to make sure that if we started yielding mutex locks, this code didn't start yielding and cause inaccurate orphan counts.

Comment by Connie Chen [ 29/Jul/22 ]

allison.easton@mongodb.com - let us know if you meant for the above comment to be on another ticket?

 

Comment by Allison Easton [ 30/Jun/22 ]

Just adding a comment here to note that upgrade to 6.0 relies on this for orphan counts to be correct, so if this gets changed on 6.0 we should make sure this still works correctly.

Generated at Thu Feb 08 06:08:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.