[SERVER-36439] Fix the concurrency control of the LogicalTimeValidator Created: 03/Aug/18 Updated: 29/Oct/23 Resolved: 19/Nov/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 4.3.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Janna Golden | Assignee: | Anton Oyung |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||
| Sprint: | Sharding 2019-01-28, Sharding 2019-02-11, Sharding 2019-11-04, Sharding 2019-11-18, Sharding 2019-12-02 | ||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 58 | ||||||||||||
| Description |
|
The concurrency control of the LogicalTimeValidator with respect to the KeyManager is not fully correct. Specifically, the _mutexKeyManager seems to serve more the purpose of a barrier than a mutex. Most likely, because the KeyManager is self-synchronizing we don't need the _mutexKeyManager. As implemented currently, a deadlock can occur between LogicalTimeValidator::shutDown() and LogicalTimeValidator::_getKeyManagerCopy() if a periodic refresh occurs during shutdown. |
| Comments |
| Comment by Misha Tyulenev [ 06/Dec/19 ] |
|
The concurrency issue just showed up in BF-15613. There the deadlock occurred when there were a message that mongos wanted to send while it was in shutdown. |
| Comment by Githook User [ 19/Nov/19 ] |
|
Author: {'username': 'AntonOyung', 'email': 'anton.oyung@mongodb.com', 'name': 'Anton Oyung'}Message: |
| Comment by Kaloian Manassiev [ 14/Nov/19 ] |
|
In that case yes, I think it is safe to throw out this mutex. Thanks, Anton. |
| Comment by Anton Oyung [ 13/Nov/19 ] |
|
kaloian.manassiev yes that's correct. The idea of adding the extra mutex on the KeyCollectionManager is in case future functionality is added to it that could possibly cause concurrency issues in the future. But maybe it's not necessary to implement that now. |
| Comment by Kaloian Manassiev [ 13/Nov/19 ] |
|
anton.oyung, what you are saying is that _mutexKeyManager protects the setting of the _keyManager field, but because it is only set in the constructor once, we can get rid of the mutex altogether. Do I understand it correctly? Because I see also that the stopMonitoring, clearCache and resetCache methods are being called under that mutex as well. However, given that the KeyManager is supposed to be self-synchronising, I think we don't need a separate outside mutex to guard it as well and because of this we should be able to throw out _mutexKeyManager indeed. misha.tyulenev, does this ^ sound reasonable? |
| Comment by Anton Oyung [ 12/Nov/19 ] |
|
It looks like at one point we used to call `_keyManager.reset()` here https://github.com/mongodb/mongo/commit/eeee1e2b64f70e8487f017ba579f3ca861c81e4f#diff-78b4e45965b90ad3d214c8cc3159d89aR213 which required us to have `_mutexKeyManager`. Since then it’s been deleted so it seems like this getter with the mutex isn’t necessary and we can instead call _keyManager directly in all of the places the getter was used. Additionally from my investigation the `KeyCollectionManager` seems thread safe since both the refresher and cache are thread safe themselves. But it might be a good idea to add a mutex and acquire it whenever the `KeyCollectionManager` does something. Does this approach sound alright or is there something else I should consider? |
| Comment by Kaloian Manassiev [ 09/May/19 ] |
|
The concurrency control of the LogicalTimeValidator with respect to the KeyManager looks pretty janky and is not fully correct. Specifically, the _mutexKeyManager seems to serve more the purpose of a barrier than a mutex. Because of this, I think the deadlock itself is the lesser problem, because it can only happen on shutdown. I am re-purposing this ticket to mean "Fix the concurrency control of LogicalTimeValidator". |
| Comment by Misha Tyulenev [ 06/May/19 ] |
|
kaloian.manassiev the blocking calls are https://github.com/mongodb/mongo/blob/master/src/mongo/db/logical_time_validator.cpp#L201 and https://github.com/mongodb/mongo/blob/master/src/mongo/db/logical_time_validator.cpp#L211 . _getKeyManagerCopy will serialize on the _mutexKeyManager |
| Comment by Kaloian Manassiev [ 06/May/19 ] |
|
The description on the ticket doesn't look right. The LogicalTimeValidator::_getKeyManagerCopy() call only takes a single mutex and doesn't do any blocking operation, so it cannot possibly be part of a deadlock cycle. misha.tyulenev, janna.golden: I think we should just throw out the _mutexKeyManager since it doesn't do anything useful. In either case though, this is not an RC blocker. |
| Comment by Misha Tyulenev [ 19/Nov/18 ] |
|
greg.mckeon I believe this will happen in all versions starting 3.6 |
| Comment by Kaloian Manassiev [ 19/Nov/18 ] |
|
This is probably more of a question for jack.mulrow or misha.tyulenev. I have not looked at any of the call stacks and have no idea if any of them changed after 3.6. |
| Comment by Gregory McKeon (Inactive) [ 19/Nov/18 ] |
|
janna.golden and kaloian.manassiev can this only happen in 3.6? Should it be 3.6 required? |