[SERVER-53084] ServerDiscoveryMonitor::shutdown can lead to deadlock Created: 25/Nov/20  Updated: 29/Oct/23  Resolved: 22/Dec/20

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: Matthew Saltz (Inactive) Assignee: Matthew Saltz (Inactive)
Resolution: Fixed Votes: 0
Labels: servicearch-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-53684 Make the callback for the hello comma... Closed
related to SERVER-77707 Always invoke onReply callback out-of... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Service arch 2020-12-28
Participants:
Linked BF Score: 38

 Description   

In a callback for TaskExecutor::scheduleRemoteCommand here, we try to acquire a mutex.
This is a problem because in ServerDiscoveryMonitor::shutdown, we acquire the mutex before canceling the remaining tasks in the executor and shutting down the executor, and either cancellation or shutdown+join can cause the callbacks to run, leading to a deadlock.

We should either reset the executor pointer and cancel the handles outside of the lock, or check the status passed into the callback for scheduleRemoteCommand and only acquire the mutex if it's an OK status.



 Comments   
Comment by Ian Whalen (Inactive) [ 04/Jan/21 ]

Author:

{'username': u'evrg-bot-webhook', 'name': u'Matthew Saltz', 'email': u'matthew.saltz@mongodb.com'}

Message:SERVER-53084 Run onFinish callback out of line in the NetworkInterfaceTL
Branch:master
https://github.com/mongodb/mongo/commit/e237f80590fe661eed0e3c9ea7988b8306dc3604

Comment by Matthew Saltz (Inactive) [ 22/Dec/20 ]

Gotcha, thought that might have been what happened I was just clarifying because we went with a solution in the NetworkInterfaceTL and not the ServerDiscoveryMonitor.

Comment by Daniel Gottlieb (Inactive) [ 22/Dec/20 ]

Ah, that makes sense then! The lack of a github push notification led me to believe that the CR was closed without pushing because a different problem was discovered.

Comment by Matthew Saltz (Inactive) [ 22/Dec/20 ]

Not that I'm aware of... Why do you ask?

Also just to clarify, the linked CR did get pushed. For whatever reason the github link didn't get added to the ticket automatically.

Comment by Daniel Gottlieb (Inactive) [ 22/Dec/20 ]

Apologies for commenting from the sideline: were the linked testing failures fixed (perhaps accidentally) as part of a different patch?

Comment by Matthew Saltz (Inactive) [ 22/Dec/20 ]

Closing this ticket since ServerDiscoveryMonitor was actually correct given the TaskExecutor::cancel contract. We can make a new ticket if we want to change the locking around for cleanup purposes.

Comment by Andrew Shuvalov (Inactive) [ 16/Dec/20 ]

The ThreadPoolTaskExecutor::cancel has its own lock and lock nesting while calling it from SingleServerDiscoveryMonitor::shutdown() and then from  SingleServerDiscoveryMonitor::_cancelOutstandingRequest is protecting the fields like `_remoteCommandHandle`, `_executor`, but is not necessary to hold the mutex while invoking the `_executor->cancel()`. I think this should be refactored. Will it be enough to fix the deadlock?

Save the _executor in local variable and invoke cancel() outside of the lock, removing a nested lock unless absolutely necessary should always be a nice cleanup.

Comment by Matthew Saltz (Inactive) [ 16/Dec/20 ]

I now think that this actually started happening after/due to this commit to make the RSM call drop() in its destructor. FYI andrew.shuvalov.

Comment by Matthew Saltz (Inactive) [ 16/Dec/20 ]

The problem is that the RSM obtained here can be the last shared_ptr to that RSM, leading to it being destroyed at the end of this scope. This is a problem because validateHosts is happening in a networking thread, which leads the cancellation logic to run inline, leading to the double-mutex acquisition

Generated at Thu Feb 08 05:29:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.