[SERVER-69028] Collect thread migrations in FTDC Created: 21/Aug/22  Updated: 02/Feb/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: Backlog - Service Architecture
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Service Arch
Participants:

 Description   

Background

Thread migrations happen when Kernel load balancer moves a scheduled process/thread from one runqueue to another when inefficiency of unbalanced "load" is higher than migration cost. High frequency of thread migrations indicate that the thread model is suboptimal, one or more cores are left idle while others have a runqueue of jobs ready to run. This may happen in at least two cases: high lock contention and/or improper use of thread pools (too many thread pools, thread pools too large or too small, etc).

Motivation

We know our thread model is suboptimal. We have thread per request model with too many auxiliary thread pools. The approximate roadmap to fix that is:
1. Consolidate thread pools
2. Create a special thread pool for blocking calls (see my presentation for details
3. Migrate to more asynchronous model
4. Design load-based admission control to reject requests that cannot be executed soon
5. Design proper token-bucket based user isolation

For all of those tasks we need proper measurements. Benchmarking the code during development is time consuming and not necessary. Using thread migrations as a quick negative signal is easy and more productive. Low frequency of thread migrations is not sufficient to indicate that the thread model is good, but high frequency is always bad. When this signal is good, other can be used (profiling, lock contention measurements, etc)

Approximate design

The current core could be detected by the code:

static unsigned getCoreId() {
    unsigned id;
    __rdtscp(&id);
    return id;
}

This claims that __rdtscp is not a serializing instruction, it only "waits until all previous instructions have executed". This also confirms this statement. Thus we probably should not worry about performance implication, CC amirsaman.memaripour@mongodb.com to confirm.

It is very convenient to use thread local to track the last current core. Indeed, thread local variable will migrate together with thread to new core. The implementation should query the current core sufficiently often and increment the thread local counter to accumulate the observed migrations. In production, we may observe involuntary context switches to the tune 200k QPS, which hints it will be sufficient to query the current core inside a new listener on `_onContendedLock()` and then on `_onUnlock()`. Perhaps it will be cheaper to add a callback `_onContendedUnlock()` because the migration is unlikely to happen if the current thread was not put to sleep. Remember, the thread migration happens only when the thread is on runqueue.

Collection

Collection requirements are:
1. Do it as simple as possible for now, keeping future improvements in mind
2. Never increment a global counter, this is very expensive. Accumulate in thread local and flush periodically
3. TL counter flush can be attached to RAII decorator on opCtx, this may increment the global counter but probably no more than ~3k times/s (and many will be 0)
4. Keep in mind that in future we would like to collect some stats per user and per op, this one is good candidate

We should accumulate the current migration count in thread local and flush when opCtx is created and destroyed. The longer shot task will be to use this count bucketed on command and on users. This will give us insight on which commands and which users are associated with the most thread migrations. This may also be used for better user isolation in future, the user creating the most of thread migrations should be the first to throttle.

Flushing this counter to opCtx is easier so far we use the thread per connection model, it will break later. For asynchronous model later, we will need to flush on ThreadClient destruction, and then when the thread is recycled in the thread pool. This is conditional to observing this as a useful signal in production.

Roadmap

1. Implement a simple solution (this ticket)
2. Observe the value in stress tests, benchmarks and help incidents
3. If good signal is observed, implement the wrapper for blocking calls as described in the presentation and test it with stress tests where the signal was good
4. Optimize the thread pool code using the wrapper implementation and the test where the good signal was observed
5. Work on the roadmap for more asynchronous model



 Comments   
Comment by Bruce Lucas (Inactive) [ 24/Aug/22 ]

Very good, thanks for the confirmation.

Comment by Andrew Shuvalov (Inactive) [ 23/Aug/22 ]

bruce.lucas@mongodb.com yes, not in this ticket. The idea of per-user counters is not to expose it in FTDC but to use it in our future user isolation implementation. As we don't have any there is no point to add it now. We need to assemble a list of ~3 different metrics to pinpoint abusive users to use it in user isolation. When we decide to do it, thread migrations will be one of those 3.

Per-op metric is strictly for manual investigations, can be exposed with additional verbose filed in `serverStatus`. We should never have this kind of granularity in default FTDC. So yes, just 1 new metric.

Comment by Bruce Lucas (Inactive) [ 23/Aug/22 ]

Adding a single counter to FTDC sounds reasonable. I would be concerned about adding anything per-command or per-user to FTDC because of the volume of counters it could create.

Generated at Thu Feb 08 06:12:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.