[SERVER-79761] GlobalLock can segfault due to not actually holding lock Created: 04/Aug/23  Updated: 22/Sep/23  Resolved: 22/Sep/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matt Boros Assignee: Josef Ahmad
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-70338 Query yield accesses the storage engi... Closed
Related
related to SERVER-81333 Make the GlobalLock and DBLock RAII t... Open
Assigned Teams:
Storage Execution EMEA
Operating System: ALL
Sprint: Execution EMEA Team 2023-10-02
Participants:
Linked BF Score: 10

 Description   

GlobalLock caches the result of its lock acquisition. Methods like isLocked reference this cached result. In the GlobalLock destructor we call isLocked to determine whether we need to abandon the current snapshot.

When we yield, we release locks using the lock manager but do not inform the global locks higher up in the stack of our release. During a yield the GlobalLock cached _result will falsely indicate a locked state. When a GlobalLock is destroyed during a yield, it can unsafely access the storage engine without a lock, causing a segfault.

Here is a patch demonstrating this issue. You can apply this diff to master and see that the lock is not actually held in some cases:

+++ b/src/mongo/db/concurrency/d_concurrency.cpp
@@ -175,6 +175,7 @@ Lock::GlobalLock::~GlobalLock() {
     auto* locker = _opCtx->lockState();
 
     if (isLocked()) {
+        invariant(_opCtx->lockState()->getLockMode(resourceIdGlobal) != LockMode::MODE_NONE);
         // Abandon our snapshot if destruction of the GlobalLock object results in actually
         // unlocking the global lock. Recursive locking and the two-phase locking protocol may
         // prevent lock release.

This was discovered in a 4.4 crash, BF-28945. We recovered a core dump where one thread is shutting down holding a global lock and destroying WT. The other thread is a GetMore command, running its GlobalLock destructor. The GetMore command segfaults while calling abandonSnapshot accessing the storage engine. It's very likely this failure is a result from this bug.

I am unsure of the severity of this bug because due to unfamiliarity with the code.



 Comments   
Comment by Josef Ahmad [ 22/Sep/23 ]

I'm closing this ticket as a duplicate of SERVER-70338. We decided not to backport SERVER-70338 to 4.4 because the backport is complex and risky. This issue has existed since at least 3.6, has only been observed in our internal continuous integration with no customer reports, and the only manifestation is a crash during server shutdown.

As a general improvement, I've also opened SERVER-81333 to make GlobalLock::isLocked() aware of query yields.

Comment by Josef Ahmad [ 21/Sep/23 ]

I'm currently trying to have GlobalLock::isLocked() return the right answer when the lock is yielded. That'll likely involve going through the lock manager.

Comment by Josef Ahmad [ 20/Sep/23 ]

I confirm the root cause to be SERVER-70338; the failure has been observed on v4.4 where the fix hasn't been backported due to complexity/risk. cc: louis.williams@mongodb.com
 
Before SERVER-70338, query yielding incorrectly rescinds the locks and then abandons the snapshot; the correct order is the opposite. Now, if a query in yielding state gets interrupted because of shutdown, the GlobalLock object - which is unaware of the lock being rescinded - attempts to abandon the snapshot and unlock. Unlocking an already unlocked resource is safely treated as a no-op; the problem is on abandonSnapshot: due to the wrong query yield ordering before SERVER-70338, abandoning the snapshot might still see an active recovery unit, and we might attempt to call into WiredTiger when it's shutting down.
 
One complicating factor in this investigation is that GlobalLock::isLocked has non-intuitive semantics. That function should ideally be named something like GlobalLock::isValidUnlessYielded. The definition of "valid" would be: the lock has been acquired once, and the object hasn't been moved from. However, if the lock has been yielded (which the object has no visibility of), that function still returns true, and the users of that method should be made aware of this potential pitfall.

Comment by Matt Boros [ 08/Aug/23 ]

The BF that lead to this discovery was from 4.4, and the patch where I show the invariant in the description failing is on master.

Comment by Gregory Noma [ 08/Aug/23 ]

We should figure out which versions this effects

Generated at Thu Feb 08 06:41:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.