[SERVER-47566] CollectionShardingRuntime::getCriticalSectionSignal accesses _critSec without locks Created: 15/Apr/20 Updated: 29/Oct/23 Resolved: 23/Apr/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Randolph Tan | Assignee: | Randolph Tan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | PM-1645-Milestone-1 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Sharding 2020-05-04 | ||||||||
| Participants: | |||||||||
| Description |
|
Currently, _critSec by itself is not thread safe. All other methods using _critSec takes the CSRLock except for this method. ShardingMigrationCriticalSection::exitCriticalSection() must lock such that ShardingMigrationCriticalSection::getSignal cannot read the _critSecSignal value while _critSecSignal is being written simultaneously – C++ memory bug. |
| Comments |
| Comment by Githook User [ 23/Apr/20 ] |
|
Author: {'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}Message: |
| Comment by Randolph Tan [ 16/Apr/20 ] |
|
Yes. ExitCrit was taking X lock in 4.0 and downgraded to IX in 4.2. |
| Comment by Kaloian Manassiev [ 16/Apr/20 ] |
|
Yeah, Dianna is right that it is still a bug for ShardingMigrationCriticalSection::getSignal, because even though it is called under a Collection Intent-lock everywhere, we seem to reset it under Collection IX + CSR X lock, which is not sufficient. We should be reading it with CSR S-lock. I think this might be since 4.2 if I am not mistaken, when we relaxed some of the collection X-lock requirements. renctan, I am going to assign it to you for the next iteration, can you please confirm if it is in 4.2/4.4? |
| Comment by Dianna Hohensee (Inactive) [ 15/Apr/20 ] |
|
I took a spin through the DatabaseShardingState critical section flag concurrency. It takes the DSSLock for all critical section things, so it looks good. I got to wondering – noting it here in case someone else wonders. |
| Comment by Dianna Hohensee (Inactive) [ 15/Apr/20 ] |
|
I think this is still a bug, even with collection locks taken today. MODE_IS and MODE_IX locks are not exclusive of one another. So ShardingMigrationCriticalSection::exitCriticalSection() resets the _criticalSection value under a MODE_IX collection lock, and the ShardingMigrationCriticalSection::getSignal reads the _criticalSection value under a MODE_IS collection lock. A read and write of the same memory at the same time. |
| Comment by Randolph Tan [ 15/Apr/20 ] |
|
Note: This is currently ok since all callers take collection IS, but this should be tighten up to allow lock free reads. |