Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-5905

Add data collection and command to get histogram of query response times



    • Backwards Compatibility:
      Fully Compatible
    • Sprint:
      Integrate+Tuning 16 (06/24/16), Integration 17 (07/15/16)


      We should have instrumentation to characterize the overall workload and response time of a mongod or mongos server. A histogram with buckets with log base 2 microsecond resolution would be a nice start. Here's a straw man proposal:

      1) For every request from a client, log the time it was received using the least expensive high resolution method. On Windows, this would be QueryPerformanceCounter().
      2) When the response is complete, compute the elapsed time in microseconds. On Windows, this would be another call to QueryPerformanceCounter() and division by a precomputed conversion factor.
      3) Add 1 to the bucket associated with this time interval. Bucket 0 gets all times below 1 microsecond, bucket 1 gets times above 1 microsecond but below 2 microseconds, bucket 2 gets times from 2 to 4 microseconds, then 4 to 8, etc. 31 buckets would cover times up to 2147 seconds and anything taking longer than 2147 seconds would go in the last bucket, so 32 buckets would cover the time periods we are most interested in.
      4) Every 10 seconds, add the histogram to a "since started" histogram, write it to a capped collection sized for one week of data, save a snapshot copy and then zero it.
      5) Provide $cmd commands to fetch the most recent snapshot and the "since started" histogram.
      6) Give MMS the ability to show the most recent snapshot and the "since started" snapshot.
      7) For extra credit, MMS could show a contour plot or some other 3D display of response time history, showing the changing shape of the curve.

      Once the baseline functionality is working, we could consider doing this by database, by collection, by request type or by some other criterion. These would be additional instances of the same feature.

      There are a lot of things that we could learn by having this information:
      1) If a query was slow at one time but not at another, was there a difference in the number of requests it was competing with in the two cases?
      2) Is a workload doing mostly very fast stuff with a little slow stuff, or is everything slow?
      3) Does a change to something in the system change the mix of response times?
      4) Do response times follow a recognizable pattern, like a bell curve with a visible center, or a skew towards fast responses, or a curve with multiple peaks?
      5) Is anything really fast, or is the minimum response time in the millisecond and above range?
      6) Do we have periods with little visible activity followed by periods when many slow requests complete?
      7) Does the addition of a new application, or a new shard, or a new mongos change the response time pattern?

      The better we can characterize workloads and our response to them, the better we can diagnose problems and propose solutions. All to the good.


          Issue Links



              kevin.albertson Kevin Albertson
              tad Tad Marshall
              5 Vote for this issue
              17 Start watching this issue