[SERVER-32448] ShardRegistry::reload() does blocking work on NetworkInterface thread Created: 21/Dec/17  Updated: 27/Oct/23  Resolved: 04/May/18

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: 3.4.14, 3.6.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Samantha Ritter (Inactive) Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Works as Designed Votes: 0
Labels: ASE, ASIO, Sharding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-32270 Use shared NetworkInterfaceASIO in mo... Closed
Assigned Teams:
Sharding
Operating System: ALL
Sprint: Platforms 2018-01-01, Platforms 2018-01-15
Participants:

 Description   

The ShardRegistry schedules itself to perform _internalReload()s periodically, using the TaskExecutor. These jobs are dispatched to NetworkInterfaceASIO, and will eventually run on its thread.

From _internalReload(), we call reload(), which tries to getAllShards() from the ShardingCatalogClientImpl. getAllShards() makes a Fetcher instance, which launches networking work, and then waits for it to join(). However, when we run this, we are already on NetworkInterfaceASIO's thread. This breaks the contract that callbacks to NetworkInterfaceASIO may not perform blocking work. Worse, when these calls are issued through the same TaskExecutor, the thread will deadlock.



 Comments   
Comment by Kaloian Manassiev [ 04/May/18 ]

The blocking work is done on a completely isolated TaskExecutor, which should not interfere with any other executors, which are used for a real networking work. Given the changes that the Platforms team is doing for ASIO, I am opting to close this ticket as WAD.

Comment by Andy Schwerin [ 24/Dec/17 ]

Take a look at the AsyncResultsMerger. It has a similar task, and a more modern API. May not be directly useful, but worth understanding.

Comment by Samantha Ritter (Inactive) [ 21/Dec/17 ]

After talking to misha.tyulenev about this a little, some callers of ShardRegistry::reload() want to use it synchronously to run manual refreshes. It seems like the best way to approach this would be to offer a synchronous reload() (the existing one) as well as an asynchronous one, to run the internal periodic refreshes, which can be async. To write a fully async reload() method requires an async _exhaustiveFindOnConfig, which requires modifications to the Fetcher so that it can be callback-driven instead of notifying waiters when it is finished. This shouldn't be too difficult to do, though, it's just a matter of re-framing the existing code around callbacks.

Comment by Samantha Ritter (Inactive) [ 21/Dec/17 ]

It's true that the commenting in task_executor.h for the scheduleWork() method doesn't say anything about what kind of work may or may not be scheduled. However, all the work scheduled through the task executor (networking or not) is going to get run through asio's async execution engine. Any blocking work that runs in a callback will clog up the works.

I am doing a POC to see if we can have TaskExecutors share NetworkInterfaceASIO instances, so that they can all share the same connection pool. This forces all the users of the TaskExecutors to share one asio thread, which is how I found this now. Before, you're right, the networking call to _exhaustiveFindOnConfig was going through a different TaskExecutor, which would have had its own NetworkInterfaceASIO instance with its own thread, so it wouldn't have deadlocked.

Ok, so I think there are two main ways to address this that would unblock my POC for the Egress Deployment project:
1. Offer an asynchronous _exhaustiveFindOnConfig method that this path could use instead. The problem is that this mostly-asynchronous workflow tries to do synchronous work (inside _exhaustiveFindOnConfig).
2. Do not use the TaskExecutor for this periodic scheduling, maybe the PeriodicRunner could be used instead? There is no 1:1 alternative to the scheduleWork() method of running future tasks that I know of, but maybe it can just run through the PeriodicRunner the way some other regularly-occurring background jobs do in the codebase.

Comment by Kaloian Manassiev [ 21/Dec/17 ]

I believe the TaskExecutor backed by ASIO was the only option available to schedule tasks at some point in the future. Is there an alternative to that?

Also just for the record, the TaskExecutor which is used for reload should be a completely separate executor and shouldn't be used for networking I/O. Did you find this to not be the case?

Generated at Thu Feb 08 04:30:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.