[SERVER-53761] Determine strategy to make the ServiceEntryPointCommon more asynchronous Created: 13/Jan/21 Updated: 05/Jan/23 |
|
| Status: | Blocked |
| Project: | Core Server |
| Component/s: | Performance |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Andrew Shuvalov (Inactive) | Assignee: | Backlog - Service Architecture |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Service Arch
|
||||||||||||||||
| Sprint: | Sharding 2021-02-22, Sharding 2021-03-08 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Background: this is the follow up to https://jira.mongodb.org/browse/SERVER-53505 trying to prevent the service entry point from blocking waiting for the tenant migration critical section blocker. The motivation to fix is is based on the anticipated outage when a tenant whose database is migrated to another replica set is sending too many clustered reads that are blocked by the migration critical section. The executor serving client reads will run out of free threads and/or create too many threads to accommodate the thread unavailability. Besides the blocker, we also have a prototype for asynchronous execution in runAsync, however it also returns a synchronous future, which will not unblock the thread when it's waiting. At the end, the service command invocation does something like `handleRequest().get()` like here. No matter how asynchronous the processing is, the `get()` is still blocking the thread. This ticket is to discuss what we can do about that. |
| Comments |
| Comment by Andrew Shuvalov (Inactive) [ 07/Nov/22 ] |
|
This presentation explored the topic. I think there is an easy path forward. |
| Comment by Andrew Shuvalov (Inactive) [ 14/Jan/21 ] |
|
We had a discussion with amirsaman.memaripour and matthew.saltz discussing the problem of blocking code on the request path. At this point we have several instances of blocking (e.g. authentication), which is one of the roadblocks from enabling new threading model in production. New threading model: capped processing threadpool, old model: a thread per connection, which is definitely not scalable and hard to control. The main problem with more asynchronous processing is performance penalty. It is believed that the refactoring that was already done costs 5% in latency. At the same time every non-asynchronous block carries the danger of creating a live-lock, when we may run out of threads for new requests. For this I should comment that in general the cost of breaking and rescheduling a continuation back to pool should cost about ~1 micros if the pool is empty and there is no additional lock contention. Any latency on top of this simply means that the pool was not empty. The penalty is especially high if some of the requests already scheduled to the pool will eventually timeout or will be throttled. This leads to the topic of insufficient user isolation. The final goal of user isolation is:
If those conditions are met the rescheduling of continuation back to the pool does not affect the average latency of the system, it simply moves the quant of latency from one request to another. When continuation is rescheduled, whatever request is currently in the front of the queue gets immediate performance boost. When we implement user isolation properly, we either account for (and limit them) threads used by each user and then we can block if we have to, or we need to account for how many pending continuations in the queue each user already has. I still see the second variant as better one, because allocating threads per user is very unflexible. About converting the code in service_entry_point_common.cpp from Future to SemiFuture, my understanding is that eventually at least some methods will have to be converted, but we should be careful with that - we reschedule only definitely blocked code, we should always avoid using any non-linear continuation for non blocking and non waiting for some external processing code. I don't fully understand how exactly to do that and perhaps matthew.saltz can shed some light. We will also work on my pending changes for https://jira.mongodb.org/browse/SERVER-53505 offline and perhaps I will get better understanding of those futures tricks once finished.
|
| Comment by Andrew Shuvalov (Inactive) [ 13/Jan/21 ] |
|
Before discussing possible actions I would like to find the answers to the following questions:
User isolation questions:
I plan to make a presentation of thread pool usage and related user isolation strategies to discuss how this could be addressed, but I need to figure out what we already have. |
| Comment by Andrew Shuvalov (Inactive) [ 13/Jan/21 ] |
|
The problems observed with possible refactoring:
|