[SERVER-32448] ShardRegistry::reload() does blocking work on NetworkInterface thread Created: 21/Dec/17 Updated: 27/Oct/23 Resolved: 04/May/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking |
| Affects Version/s: | 3.4.14, 3.6.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Samantha Ritter (Inactive) | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | ASE, ASIO, Sharding | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Sharding
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Platforms 2018-01-01, Platforms 2018-01-15 | ||||||||
| Participants: | |||||||||
| Description |
|
The ShardRegistry schedules itself to perform _internalReload()s periodically, using the TaskExecutor. These jobs are dispatched to NetworkInterfaceASIO, and will eventually run on its thread. From _internalReload(), we call reload(), which tries to getAllShards() from the ShardingCatalogClientImpl. getAllShards() makes a Fetcher instance, which launches networking work, and then waits for it to join(). However, when we run this, we are already on NetworkInterfaceASIO's thread. This breaks the contract that callbacks to NetworkInterfaceASIO may not perform blocking work. Worse, when these calls are issued through the same TaskExecutor, the thread will deadlock. |
| Comments |
| Comment by Kaloian Manassiev [ 04/May/18 ] |
|
The blocking work is done on a completely isolated TaskExecutor, which should not interfere with any other executors, which are used for a real networking work. Given the changes that the Platforms team is doing for ASIO, I am opting to close this ticket as WAD. |
| Comment by Andy Schwerin [ 24/Dec/17 ] |
|
Take a look at the AsyncResultsMerger. It has a similar task, and a more modern API. May not be directly useful, but worth understanding. |
| Comment by Samantha Ritter (Inactive) [ 21/Dec/17 ] |
|
After talking to misha.tyulenev about this a little, some callers of ShardRegistry::reload() want to use it synchronously to run manual refreshes. It seems like the best way to approach this would be to offer a synchronous reload() (the existing one) as well as an asynchronous one, to run the internal periodic refreshes, which can be async. To write a fully async reload() method requires an async _exhaustiveFindOnConfig, which requires modifications to the Fetcher so that it can be callback-driven instead of notifying waiters when it is finished. This shouldn't be too difficult to do, though, it's just a matter of re-framing the existing code around callbacks. |
| Comment by Samantha Ritter (Inactive) [ 21/Dec/17 ] |
|
It's true that the commenting in task_executor.h for the scheduleWork() method doesn't say anything about what kind of work may or may not be scheduled. However, all the work scheduled through the task executor (networking or not) is going to get run through asio's async execution engine. Any blocking work that runs in a callback will clog up the works. I am doing a POC to see if we can have TaskExecutors share NetworkInterfaceASIO instances, so that they can all share the same connection pool. This forces all the users of the TaskExecutors to share one asio thread, which is how I found this now. Before, you're right, the networking call to _exhaustiveFindOnConfig was going through a different TaskExecutor, which would have had its own NetworkInterfaceASIO instance with its own thread, so it wouldn't have deadlocked. Ok, so I think there are two main ways to address this that would unblock my POC for the Egress Deployment project: |
| Comment by Kaloian Manassiev [ 21/Dec/17 ] |
|
I believe the TaskExecutor backed by ASIO was the only option available to schedule tasks at some point in the future. Is there an alternative to that? Also just for the record, the TaskExecutor which is used for reload should be a completely separate executor and shouldn't be used for networking I/O. Did you find this to not be the case? |