[SERVER-68194] Remove lock from ShardSplitDonorService::checkIfOptionsConflict Created: 20/Jul/22  Updated: 29/Oct/23  Resolved: 27/Jul/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 6.1.0-rc0
Fix Version/s: 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Alex Li Assignee: Didier Nadeau
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Server Serverless 2022-08-08
Participants:
Linked BF Score: 145

 Description   

Remove the lock from ShardSplitDonorService::DonorStateMachine::checkIfOptionsConflict. Create const member fields like tenant_migration_donor_service (https://github.com/10gen/mongo/blob/103cf67c1488288beb7d8bd68f3550a612d6d054/src/mongo/db/repl/tenant_migration_donor_service.h#L288-L296) so lock is not needed to check options.

Change is aimed to avoid deadlock when:

(1) Instance lock is held in ShardSplitDonorService::DonorStateMachine::run (https://github.com/10gen/mongo/blob/9743d69c7dbc92fd5abcf9c799daff89e30ca233/src/mongo/db/serverless/shard_split_donor_service.cpp#L335)

(2) POS lock is held ShardSplitDonorService::getOrCreateInstance (https://github.com/10gen/mongo/blob/9743d69c7dbc92fd5abcf9c799daff89e30ca233/src/mongo/db/repl/primary_only_service.cpp#L543)

(3) getOrCreateInstance -> ShardSplitDonorService::DonorStateMachine::checkIfOptionsConflict waits on instance lock (https://github.com/10gen/mongo/blob/9743d69c7dbc92fd5abcf9c799daff89e30ca233/src/mongo/db/serverless/shard_split_donor_service.cpp#L295) 

(2) onCreateOperationContext -> ShardSplitDonorService::registerOpCtx waits on POS lock (https://github.com/10gen/mongo/blob/9743d69c7dbc92fd5abcf9c799daff89e30ca233/src/mongo/db/repl/primary_only_service.cpp#L261). 



 Comments   
Comment by Didier Nadeau [ 28/Jul/22 ]

suganthi.mani@mongodb.com I think that's fair and matt.broadstone@mongodb.com 's idea (making it a static method and passing the state document) is worth exploring. I think it can get slightly complex because the synchronization of the state doc is left to each POS' design. Due to this complexity I think this small scale fix is appropriate for the hot BF, and we should raise a separate ticket for the larger scale refactor. Let's discuss it tomorrow!

Comment by Suganthi Mani [ 27/Jul/22 ]

Just caught my attention, I don't really agree with the fix. It's a common pattern to create an opCtx with the instance mutex lock held. So, in future, it's easy to introduce the deadlock mentioned above in the shard split code. It also came up in other POS service, similar kind of deadlocks due to lock order violation between primaryOnlyService Mutex and InstanceMutex. I think we should have some kind of cleaner fix in the POS code. CC matt.broadstone@mongodb.com

Comment by Githook User [ 27/Jul/22 ]

Author:

{'name': 'Didier Nadeau', 'email': 'didier.nadeau@mongodb.com', 'username': 'nadeaudi'}

Message: SERVER-68194 Acquire lock after operation context creation to avoid deadlock
Branch: master
https://github.com/mongodb/mongo/commit/fddc60968b4e9b533e5709de1bb151c0353c0301

Comment by Mathis Bessa [ 22/Jul/22 ]

billy.donahue@mongodb.com alex.li@mongodb.com Thank you for the investigation! Feel free to re-assign to us. We can make the fix and also port some of the changes already done. I do see in the ticket that indeed we are holding `_mutex` while calling `makeOperationContext` which in turn also calls `PrimaryOnlyService::registerOpCtx` which also waits on the POS mutex lock.
We can see that in the backtrace of the BF

Comment by Billy Donahue [ 21/Jul/22 ]

https://github.com/10gen/mongo/blob/9743d69c7dbc92fd5abcf9c799daff89e30ca233/src/mongo/db/serverless/shard_split_donor_service.cpp#L335),

    const bool shouldRemoveStateDocumentOnRecipient = [&]() {
        stdx::lock_guard<Latch> lg(_mutex);
        auto opCtx = _cancelableOpCtxFactory->makeOperationContext(&cc());
        return serverless::shouldRemoveStateDocumentOnRecipient(opCtx.get(), _stateDoc);
    }();

If we moved the definition of lg down by one line would the deadlock be fixed?

Comment by Billy Donahue [ 21/Jul/22 ]

The deadlock, now that we've identified it, is really serverless team's bug. Should we kick it back to them?

Generated at Thu Feb 08 06:10:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.