[SERVER-48734] Fix db_repl_test in TSAN Created: 11/Jun/20  Updated: 29/Oct/23  Resolved: 07/Jul/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.7.0

Type: Bug Priority: Major - P3
Reporter: Lingzhi Deng Assignee: Lingzhi Deng
Resolution: Fixed Votes: 0
Labels: thread-sanitizer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-48687 ReplicationCoordinatorMock is not thr... Closed
Related
related to SERVER-48964 Potential data race in LockRequest.st... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2020-06-29, Repl 2020-07-13
Participants:
Linked BF Score: 0

 Description   

db_repl_test failed under TSAN. And one reason was because of concurrent writes of _pings in ElapsedTracker. But _pings is an int32_t and it should be harmless except for imprecise counts. So maybe we should fix ElapsedTracker to make it thread-safe, with AtomicWord or mutexes.

ElapsedTracker is not owned by repl. However, more data races could come up in the test once we fix ElapsedTracker. So it is hard to estimate how long it would take to make the test pass.



 Comments   
Comment by Githook User [ 07/Jul/20 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-48734: Fix db_repl_test in TSAN
Branch: master
https://github.com/mongodb/mongo/commit/95ee10210b3b92b166410e70b33a5fce39ec0958

Comment by Lingzhi Deng [ 18/Jun/20 ]

Thanks for the investigation and that makes sense. Since there could be more issues along the way to make the whole unittest pass (and many of them may be common issues shared with other failures in the TSAN build variant), I think this might eventually get too big. I agree with you to have a seperate ticket for this, at least to double check that this is safe. So I filed SERVER-48964.
I will try with your patch and continue my investigation for other issues we may still have for the repl unittests.

Comment by Louis Williams [ 18/Jun/20 ]

For example, this patch may fix the problem:

diff --git a/src/mongo/db/concurrency/lock_manager.cpp b/src/mongo/db/concurrency/lock_manager.cpp
index a9c7983ab9..1c8025d760 100644
--- a/src/mongo/db/concurrency/lock_manager.cpp
+++ b/src/mongo/db/concurrency/lock_manager.cpp
@@ -221,7 +221,9 @@ struct LockHead {
         }
 
         // No conflict, new request
-        request->status = LockRequest::STATUS_GRANTED;
+        if (request->status != LockRequest::STATUS_GRANTED) {
+            request->status = LockRequest::STATUS_GRANTED;
+        }
 
         grantedList.push_back(request);
         incGrantedModeCount(request->mode);

 

Comment by Louis Williams [ 18/Jun/20 ]

I updated my comment to provide a more thorough explanation of the failure. We should open a separate ticket to evaluate these issues. I'm not an expert on the LockManager, but I think there is more investigation to be done here.

In this instance, I think a fix would be to have the LockManager only write the state of an existing LockRequest if it isn't already STATUS_GRANTED. 

Comment by Lingzhi Deng [ 18/Jun/20 ]

Yes, but what the stack trace shown was that two threads called _lockBegin at the same time (which I believe violates how we should use the Locker) where one thread inserted a request to the _requests map and subsequently modified the newly inserted request object in LockHead::newRequest while the other thread was at a dassert looping through the _requests map, accessing each request object in the map.

If locking under multiple threads is not a proper way to use the Locker and you think this won't happen in production code, then maybe we should rewrite the test? The test currently uses the same opCtx for the rollback (and index builds) and the test thread. Maybe we can use a different opCtx for the test?

Comment by Andrew Morrow (Inactive) [ 18/Jun/20 ]

louis.williams - "failing TSAN" isn't really an option though, as we intend to run the entire server test suite under TSAN. Accessing the same resource under different mutexes isn't generally guaranteed to have the intended synchronization effects. Could you please elaborate on why it is safe to do so for the lock manager? And we will need some way forward: either revising the lock manager in a way that makes clear to TSAN the synchronization edges that render the behavior safe, or blacklisting the lock manager from TSAN, or suppressing the error at runtime. The latter two are less preferred answers, because it means that the lock manager will have no/reduced thread sanitizer coverage, despite sounding like something for which TSAN coverage would in fact be ideal.

Comment by Louis Williams [ 18/Jun/20 ]

Under the same Locker, no. Internally in the lock manager, multiple threads may access resources under different mutexes (as seen), but that is intended. Our lock manager will probably fail TSAN because of this, but I think we should take a look just to confirm.

It seems like this debug assertion is causing problems because it's reading its already-granted LockRequests.

The migratePartitionedLockHeads code is called when an operation tries to acquire a non-partitioned lock mode, S or X, which happens here in the test.

This is what is happening: an index build is holding an IX Database lock and attempting to take an IX Collection lock. This is concurrent with an operation taking an X Database lock. The database X lock acquisition does a migration of all other Database IX locks and ends up writing over the index build thread's existing IX database LockRequest with STATUS_GRANTED, even though the existing request would have already been in the same state. The index build thread only observed this because of the debug assertion on getWaitingStatus(). The net effect is that the thread sanitizer fails.

I'm not sure what the procedure is for TSAN failures since this is new, but we make this specific failure go away by changing the test to take a Collection X locks instead of a Database X locks (they shouldn't be there, anyway). There might still be TSAN failures in this unit test and outside, however, and we would need to investigate those, too.

Comment by Lingzhi Deng [ 12/Jun/20 ]

The third issue is the ReplicationCoordinatorMock, SERVER-48687

Comment by Lingzhi Deng [ 12/Jun/20 ]

Another issue is sharedCallbackStateDestroyed in oplog_fetcher_test.cpp should be AtomicWord<bool> because it is set in the oplog fetcher callback but is read in the test thread.

Generated at Thu Feb 08 05:17:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.