[SERVER-48734] Fix db_repl_test in TSAN Created: 11/Jun/20 Updated: 29/Oct/23 Resolved: 07/Jul/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Lingzhi Deng | Assignee: | Lingzhi Deng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | thread-sanitizer | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Repl 2020-06-29, Repl 2020-07-13 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||
| Description |
|
db_repl_test failed under TSAN. And one reason was because of concurrent writes of _pings in ElapsedTracker. But _pings is an int32_t and it should be harmless except for imprecise counts. So maybe we should fix ElapsedTracker to make it thread-safe, with AtomicWord or mutexes. ElapsedTracker is not owned by repl. However, more data races could come up in the test once we fix ElapsedTracker. So it is hard to estimate how long it would take to make the test pass. |
| Comments |
| Comment by Githook User [ 07/Jul/20 ] | |||||||||||||||
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: | |||||||||||||||
| Comment by Lingzhi Deng [ 18/Jun/20 ] | |||||||||||||||
|
Thanks for the investigation and that makes sense. Since there could be more issues along the way to make the whole unittest pass (and many of them may be common issues shared with other failures in the TSAN build variant), I think this might eventually get too big. I agree with you to have a seperate ticket for this, at least to double check that this is safe. So I filed | |||||||||||||||
| Comment by Louis Williams [ 18/Jun/20 ] | |||||||||||||||
|
For example, this patch may fix the problem:
| |||||||||||||||
| Comment by Louis Williams [ 18/Jun/20 ] | |||||||||||||||
|
I updated my comment to provide a more thorough explanation of the failure. We should open a separate ticket to evaluate these issues. I'm not an expert on the LockManager, but I think there is more investigation to be done here. In this instance, I think a fix would be to have the LockManager only write the state of an existing LockRequest if it isn't already STATUS_GRANTED. | |||||||||||||||
| Comment by Lingzhi Deng [ 18/Jun/20 ] | |||||||||||||||
|
Yes, but what the stack trace shown was that two threads called _lockBegin at the same time (which I believe violates how we should use the Locker) where one thread inserted a request to the _requests map and subsequently modified the newly inserted request object in LockHead::newRequest while the other thread was at a dassert looping through the _requests map, accessing each request object in the map. If locking under multiple threads is not a proper way to use the Locker and you think this won't happen in production code, then maybe we should rewrite the test? The test currently uses the same opCtx for the rollback (and index builds) and the test thread. Maybe we can use a different opCtx for the test? | |||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 18/Jun/20 ] | |||||||||||||||
|
louis.williams - "failing TSAN" isn't really an option though, as we intend to run the entire server test suite under TSAN. Accessing the same resource under different mutexes isn't generally guaranteed to have the intended synchronization effects. Could you please elaborate on why it is safe to do so for the lock manager? And we will need some way forward: either revising the lock manager in a way that makes clear to TSAN the synchronization edges that render the behavior safe, or blacklisting the lock manager from TSAN, or suppressing the error at runtime. The latter two are less preferred answers, because it means that the lock manager will have no/reduced thread sanitizer coverage, despite sounding like something for which TSAN coverage would in fact be ideal. | |||||||||||||||
| Comment by Louis Williams [ 18/Jun/20 ] | |||||||||||||||
|
Under the same Locker, no. Internally in the lock manager, multiple threads may access resources under different mutexes (as seen), but that is intended. Our lock manager will probably fail TSAN because of this, but I think we should take a look just to confirm. It seems like this debug assertion is causing problems because it's reading its already-granted LockRequests. The migratePartitionedLockHeads code is called when an operation tries to acquire a non-partitioned lock mode, S or X, which happens here in the test. This is what is happening: an index build is holding an IX Database lock and attempting to take an IX Collection lock. This is concurrent with an operation taking an X Database lock. The database X lock acquisition does a migration of all other Database IX locks and ends up writing over the index build thread's existing IX database LockRequest with STATUS_GRANTED, even though the existing request would have already been in the same state. The index build thread only observed this because of the debug assertion on getWaitingStatus(). The net effect is that the thread sanitizer fails. I'm not sure what the procedure is for TSAN failures since this is new, but we make this specific failure go away by changing the test to take a Collection X locks instead of a Database X locks (they shouldn't be there, anyway). There might still be TSAN failures in this unit test and outside, however, and we would need to investigate those, too. | |||||||||||||||
| Comment by Lingzhi Deng [ 12/Jun/20 ] | |||||||||||||||
|
The third issue is the ReplicationCoordinatorMock, | |||||||||||||||
| Comment by Lingzhi Deng [ 12/Jun/20 ] | |||||||||||||||
|
Another issue is sharedCallbackStateDestroyed in oplog_fetcher_test.cpp should be AtomicWord<bool> because it is set in the oplog fetcher callback but is read in the test thread. |