[SERVER-36439] Fix the concurrency control of the LogicalTimeValidator Created: 03/Aug/18  Updated: 29/Oct/23  Resolved: 19/Nov/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.3.2

Type: Bug Priority: Major - P3
Reporter: Janna Golden Assignee: Anton Oyung
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: Sharding 2019-01-28, Sharding 2019-02-11, Sharding 2019-11-04, Sharding 2019-11-18, Sharding 2019-12-02
Participants:
Linked BF Score: 58

 Description   

The concurrency control of the LogicalTimeValidator with respect to the KeyManager is not fully correct. Specifically, the _mutexKeyManager seems to serve more the purpose of a barrier than a mutex. Most likely, because the KeyManager is self-synchronizing we don't need the _mutexKeyManager.

As implemented currently, a deadlock can occur between LogicalTimeValidator::shutDown() and LogicalTimeValidator::_getKeyManagerCopy() if a periodic refresh occurs during shutdown.



 Comments   
Comment by Misha Tyulenev [ 06/Dec/19 ]

The concurrency issue just showed up in BF-15613. There the deadlock occurred when there were a message that mongos wanted to send while it was in shutdown.
The trySignLogicalTime needs _getKeyManagerCopy that takes a lock
But the lock is already taken as its in shutdown.

Comment by Githook User [ 19/Nov/19 ]

Author:

{'username': 'AntonOyung', 'email': 'anton.oyung@mongodb.com', 'name': 'Anton Oyung'}

Message: SERVER-36439 Fix the concurrency control of the LogicalTimeValidator
Branch: master
https://github.com/mongodb/mongo/commit/5b73c31741584597bae0455a50364513c05dc1d1

Comment by Kaloian Manassiev [ 14/Nov/19 ]

In that case yes, I think it is safe to throw out this mutex. Thanks, Anton.

Comment by Anton Oyung [ 13/Nov/19 ]

kaloian.manassiev yes that's correct. The idea of adding the extra mutex on the KeyCollectionManager is in case future functionality is added to it that could possibly cause concurrency issues in the future. But maybe it's not necessary to implement that now. 

Comment by Kaloian Manassiev [ 13/Nov/19 ]

anton.oyung, what you are saying is that _mutexKeyManager protects the setting of the _keyManager field, but because it is only set in the constructor once, we can get rid of the mutex altogether. Do I understand it correctly?

Because I see also that the stopMonitoring, clearCache and resetCache methods are being called under that mutex as well. However, given that the KeyManager is supposed to be self-synchronising, I think we don't need a separate outside mutex to guard it as well and because of this we should be able to throw out _mutexKeyManager indeed.

misha.tyulenev, does this ^ sound reasonable?

Comment by Anton Oyung [ 12/Nov/19 ]

It looks like at one point we used to call `_keyManager.reset()` here https://github.com/mongodb/mongo/commit/eeee1e2b64f70e8487f017ba579f3ca861c81e4f#diff-78b4e45965b90ad3d214c8cc3159d89aR213 which required us to have `_mutexKeyManager`. Since then it’s been deleted so it seems like this getter with the mutex isn’t necessary and we can instead call _keyManager directly in all of the places the getter was used.

Additionally from my investigation the `KeyCollectionManager` seems thread safe since both the refresher and cache are thread safe themselves. But it might be a good idea to add a mutex and acquire it whenever the `KeyCollectionManager` does something. 

Does this approach sound alright or is there something else I should consider?

Comment by Kaloian Manassiev [ 09/May/19 ]

The concurrency control of the LogicalTimeValidator with respect to the KeyManager looks pretty janky and is not fully correct. Specifically, the _mutexKeyManager seems to serve more the purpose of a barrier than a mutex. Because of this, I think the deadlock itself is the lesser problem, because it can only happen on shutdown.

I am re-purposing this ticket to mean "Fix the concurrency control of LogicalTimeValidator".

Comment by Misha Tyulenev [ 06/May/19 ]

kaloian.manassiev the blocking calls are https://github.com/mongodb/mongo/blob/master/src/mongo/db/logical_time_validator.cpp#L201 and https://github.com/mongodb/mongo/blob/master/src/mongo/db/logical_time_validator.cpp#L211 . _getKeyManagerCopy will serialize on the _mutexKeyManager

Comment by Kaloian Manassiev [ 06/May/19 ]

The description on the ticket doesn't look right. The LogicalTimeValidator::_getKeyManagerCopy() call only takes a single mutex and doesn't do any blocking operation, so it cannot possibly be part of a deadlock cycle.

misha.tyulenev, janna.golden: I think we should just throw out the _mutexKeyManager since it doesn't do anything useful.

In either case though, this is not an RC blocker.

Comment by Misha Tyulenev [ 19/Nov/18 ]

greg.mckeon I believe this will happen in all versions starting 3.6

Comment by Kaloian Manassiev [ 19/Nov/18 ]

This is probably more of a question for jack.mulrow or misha.tyulenev. I have not looked at any of the call stacks and have no idea if any of them changed after 3.6.

Comment by Gregory McKeon (Inactive) [ 19/Nov/18 ]

janna.golden and kaloian.manassiev can this only happen in 3.6? Should it be 3.6 required?

Generated at Thu Feb 08 04:43:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.