[SERVER-53084] ServerDiscoveryMonitor::shutdown can lead to deadlock Created: 25/Nov/20 Updated: 29/Oct/23 Resolved: 22/Dec/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking |
| Affects Version/s: | None |
| Fix Version/s: | 4.9.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matthew Saltz (Inactive) | Assignee: | Matthew Saltz (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | servicearch-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Service arch 2020-12-28 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 38 | ||||||||||||||||
| Description |
|
In a callback for TaskExecutor::scheduleRemoteCommand here, we try to acquire a mutex. We should either reset the executor pointer and cancel the handles outside of the lock, or check the status passed into the callback for scheduleRemoteCommand and only acquire the mutex if it's an OK status. |
| Comments |
| Comment by Ian Whalen (Inactive) [ 04/Jan/21 ] |
|
Author: {'username': u'evrg-bot-webhook', 'name': u'Matthew Saltz', 'email': u'matthew.saltz@mongodb.com'}Message: |
| Comment by Matthew Saltz (Inactive) [ 22/Dec/20 ] |
|
Gotcha, thought that might have been what happened |
| Comment by Daniel Gottlieb (Inactive) [ 22/Dec/20 ] |
|
Ah, that makes sense then! The lack of a github push notification led me to believe that the CR was closed without pushing because a different problem was discovered. |
| Comment by Matthew Saltz (Inactive) [ 22/Dec/20 ] |
|
Not that I'm aware of... Why do you ask? Also just to clarify, the linked CR did get pushed. For whatever reason the github link didn't get added to the ticket automatically. |
| Comment by Daniel Gottlieb (Inactive) [ 22/Dec/20 ] |
|
Apologies for commenting from the sideline: were the linked testing failures fixed (perhaps accidentally) as part of a different patch? |
| Comment by Matthew Saltz (Inactive) [ 22/Dec/20 ] |
|
Closing this ticket since ServerDiscoveryMonitor was actually correct given the TaskExecutor::cancel contract. We can make a new ticket if we want to change the locking around for cleanup purposes. |
| Comment by Andrew Shuvalov (Inactive) [ 16/Dec/20 ] |
|
The ThreadPoolTaskExecutor::cancel has its own lock and lock nesting while calling it from SingleServerDiscoveryMonitor::shutdown() and then from SingleServerDiscoveryMonitor::_cancelOutstandingRequest is protecting the fields like `_remoteCommandHandle`, `_executor`, but is not necessary to hold the mutex while invoking the `_executor->cancel()`. I think this should be refactored. Will it be enough to fix the deadlock? Save the _executor in local variable and invoke cancel() outside of the lock, removing a nested lock unless absolutely necessary should always be a nice cleanup. |
| Comment by Matthew Saltz (Inactive) [ 16/Dec/20 ] |
|
I now think that this actually started happening after/due to this commit to make the RSM call drop() in its destructor. FYI andrew.shuvalov. |
| Comment by Matthew Saltz (Inactive) [ 16/Dec/20 ] |
|
The problem is that the RSM obtained here can be the last shared_ptr to that RSM, leading to it being destroyed at the end of this scope. This is a problem because validateHosts is happening in a networking thread, which leads the cancellation logic to run inline, leading to the double-mutex acquisition |