Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-69028

Collect thread migrations in FTDC

    • Type: Icon: Task Task
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Server Programmability
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None

      Background

      Thread migrations happen when Kernel load balancer moves a scheduled process/thread from one runqueue to another when inefficiency of unbalanced "load" is higher than migration cost. High frequency of thread migrations indicate that the thread model is suboptimal, one or more cores are left idle while others have a runqueue of jobs ready to run. This may happen in at least two cases: high lock contention and/or improper use of thread pools (too many thread pools, thread pools too large or too small, etc).

      Motivation

      We know our thread model is suboptimal. We have thread per request model with too many auxiliary thread pools. The approximate roadmap to fix that is:
      1. Consolidate thread pools
      2. Create a special thread pool for blocking calls (see my presentation for details
      3. Migrate to more asynchronous model
      4. Design load-based admission control to reject requests that cannot be executed soon
      5. Design proper token-bucket based user isolation

      For all of those tasks we need proper measurements. Benchmarking the code during development is time consuming and not necessary. Using thread migrations as a quick negative signal is easy and more productive. Low frequency of thread migrations is not sufficient to indicate that the thread model is good, but high frequency is always bad. When this signal is good, other can be used (profiling, lock contention measurements, etc)

      Approximate design

      The current core could be detected by the code:

      static unsigned getCoreId() {
          unsigned id;
          __rdtscp(&id);
          return id;
      }
      
      

      This claims that __rdtscp is not a serializing instruction, it only "waits until all previous instructions have executed". This also confirms this statement. Thus we probably should not worry about performance implication, CC amirsaman.memaripour@mongodb.com to confirm.

      It is very convenient to use thread local to track the last current core. Indeed, thread local variable will migrate together with thread to new core. The implementation should query the current core sufficiently often and increment the thread local counter to accumulate the observed migrations. In production, we may observe involuntary context switches to the tune 200k QPS, which hints it will be sufficient to query the current core inside a new listener on `_onContendedLock()` and then on `_onUnlock()`. Perhaps it will be cheaper to add a callback `_onContendedUnlock()` because the migration is unlikely to happen if the current thread was not put to sleep. Remember, the thread migration happens only when the thread is on runqueue.

      Collection

      Collection requirements are:
      1. Do it as simple as possible for now, keeping future improvements in mind
      2. Never increment a global counter, this is very expensive. Accumulate in thread local and flush periodically
      3. TL counter flush can be attached to RAII decorator on opCtx, this may increment the global counter but probably no more than ~3k times/s (and many will be 0)
      4. Keep in mind that in future we would like to collect some stats per user and per op, this one is good candidate

      We should accumulate the current migration count in thread local and flush when opCtx is created and destroyed. The longer shot task will be to use this count bucketed on command and on users. This will give us insight on which commands and which users are associated with the most thread migrations. This may also be used for better user isolation in future, the user creating the most of thread migrations should be the first to throttle.

      Flushing this counter to opCtx is easier so far we use the thread per connection model, it will break later. For asynchronous model later, we will need to flush on ThreadClient destruction, and then when the thread is recycled in the thread pool. This is conditional to observing this as a useful signal in production.

      Roadmap

      1. Implement a simple solution (this ticket)
      2. Observe the value in stress tests, benchmarks and help incidents
      3. If good signal is observed, implement the wrapper for blocking calls as described in the presentation and test it with stress tests where the signal was good
      4. Optimize the thread pool code using the wrapper implementation and the test where the good signal was observed
      5. Work on the roadmap for more asynchronous model

            Assignee:
            Unassigned Unassigned
            Reporter:
            andrew.shuvalov@mongodb.com Andrew Shuvalov (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated: