|
suganthi.mani@mongodb.com I think that's fair and matt.broadstone@mongodb.com 's idea (making it a static method and passing the state document) is worth exploring. I think it can get slightly complex because the synchronization of the state doc is left to each POS' design. Due to this complexity I think this small scale fix is appropriate for the hot BF, and we should raise a separate ticket for the larger scale refactor. Let's discuss it tomorrow!
|
|
Just caught my attention, I don't really agree with the fix. It's a common pattern to create an opCtx with the instance mutex lock held. So, in future, it's easy to introduce the deadlock mentioned above in the shard split code. It also came up in other POS service, similar kind of deadlocks due to lock order violation between primaryOnlyService Mutex and InstanceMutex. I think we should have some kind of cleaner fix in the POS code. CC matt.broadstone@mongodb.com
|
|
Author:
{'name': 'Didier Nadeau', 'email': 'didier.nadeau@mongodb.com', 'username': 'nadeaudi'}
Message: SERVER-68194 Acquire lock after operation context creation to avoid deadlock
Branch: master
https://github.com/mongodb/mongo/commit/fddc60968b4e9b533e5709de1bb151c0353c0301
|
|
billy.donahue@mongodb.com alex.li@mongodb.com Thank you for the investigation! Feel free to re-assign to us. We can make the fix and also port some of the changes already done. I do see in the ticket that indeed we are holding `_mutex` while calling `makeOperationContext` which in turn also calls `PrimaryOnlyService::registerOpCtx` which also waits on the POS mutex lock.
We can see that in the backtrace of the BF
|
|
https://github.com/10gen/mongo/blob/9743d69c7dbc92fd5abcf9c799daff89e30ca233/src/mongo/db/serverless/shard_split_donor_service.cpp#L335),
const bool shouldRemoveStateDocumentOnRecipient = [&]() {
|
stdx::lock_guard<Latch> lg(_mutex);
|
auto opCtx = _cancelableOpCtxFactory->makeOperationContext(&cc());
|
return serverless::shouldRemoveStateDocumentOnRecipient(opCtx.get(), _stateDoc);
|
}();
|
If we moved the definition of lg down by one line would the deadlock be fixed?
|
|
The deadlock, now that we've identified it, is really serverless team's bug. Should we kick it back to them?
|
Generated at Thu Feb 08 06:10:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.