Loading...

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Btree, Statistics
Labels:
None

Assigned Teams:

Storage Engines - Foundations
Total Hours with Assigned Team:
1,310.388
Sprint:
None
Story Points:
None

This ticket is to implement an amortized way of collecting statistics about a Btree. There's a very cheap way to check for an occurrence that fires randomly 1/1000 or one in 1/10000 (we can really pick any number, lets say 1/T for one in a thousand in this discussion). The check is a decrement and test against zero of a session field, that's it. With a check that cheap, instrumentation can be put even in a hot path, for example: 1/T times in accessing a btree page, atomically increment a btree counter whether this is in service of an insert/update/remove/etc. and if it is on the rightmost node at a level (to check for tail insert or tail read patterns), record data size and variance, lots of things. To cut down on latency of collection lots of info at one location, we can divide it up. That is, if we fire on the 1/T check, then 1/4 of those times, collect some statistics, 1/4 collect others, etc.

I'm thinking of 10-20 or so such statistics to start. Some issues: One is that adding these as "regular" statistics on a btree has an outsized impact. Currently, to avoid contention, we have a hashed array of stats, sized 23. With 20 stats * 8 bytes per * 23 fanout =~ 3.5k bytes per btree. I love my stats, but having 100K Btrees in some deployments is a third of a gigabyte extra overhead that could otherwise be used for cache, etc. Now that will have a perf impact. But since we're already doing a 1/T check, we don't have to do the fanout. We could have a special section of btree stats that doesn't use the fanout - but I'd advocate something different: having a specialized set of 32bit stats attached to the btree (again, real cheap, let's collect lots). This will make more sense in a minute.

Issue two is that especially with a large number of btrees, mongod is not collecting/exposing btree stats. It's just too much. So even if a customer has < 100 btrees, we have no idea of the shapes of the heavily used ones, and this lowers our ability to help customers. So here's the second trick. Let's have one of the stats represent something like total accesses. Then, we piggyback on the connection dhandle sweep (encapsulating of course), every visit to a btree dhandle, we look at the total and ask - is this in the top 16 that I've seen in this sweep? If it is, we save the btree's chunk of stats, potentially knocking out whatever was in 16th place. At the end of the sweep, we have the top 16 - and we push all of them into new connection statistics (16 best * 20 stats == 320 atomic increments). BTW this need not run every dhandle sweep, perhaps a cadence of every 30 seconds is good.

The end result is that we can look at stats that tell us, for the 16 most heavily accessed btrees, for each btree, are there heavy insertions on right side? read accesses on right? avg key and data size, ratio of reads/writes? In today's world we might see a mix of 80% reads, 20% writes and guess at conclusions. But suppose one big btree is 90% writes, and lots of little btrees are getting 99% reads - that paints a different picture. BTW, maybe we also collect, at a lower frequency an aggregate "all-the-rest" stats for those not in the top 16 - that could be more expensive to calculate, but might provide some coverage for installations with huge numbers of btrees. (A more sophisticated approach could classify bunches of btrees by statistic shapes - could be a followup).

The cheap random checker is pretty simple: put a signed countdown counter on session, say WT_SESSION.hotpath_counter_1000. The checker needs a support function. Together they are:

/*
 * Returns a somewhat random number between 0 and n-1.
 * Not a strong RNG, and delivers poor results if n is large
 * and/or this is called with very high frequency.
 */
static WT_INLINE uint64_t
wt_random_hotpath(uint16_t n)
{
    /* CPU cycle counter, quite fast */
    return (__wt_rdtsc() % n);
}

/*
 * Fires 1/1000 times, randomly.  Not a strong RNG, but
 * useful for doing low overhead amortization.
 */
static WT_INLINE bool
__wt_random_hotpath_1000(WT_SESSION *session)
{
    if (WT_UNLIKELY(--session->hotpath_counter_1000 < 0)) {
        session->hotpath_counter_1000 = __wt_random_hotpath(2000);
        return (true);
    }
    return (false);
}

Some final notes:

If I were to increment a counter, I'll want to increment by the counter period. E.g.:

/* ... in insert path ... */
if (WT_UNLIKELY(__wt_random_hotpath_1000(session))) {
   atomic_increment(&btree->insert_counter, 1000);
}

We're incrementing by 1000, this is helpful to understand ratios between metrics correctly - we may have some counters that happen so rarely (unusual conditions or errors), that we want to increment them always, and by 1.

We'll always use WT_UNLIKELY with this in the hotpath.

Also hotpath_1000 can be called for multiple purposes - for example, it could be used for the btree collection here. But some other - say transaction stat collecting that want to fire 1/1000 times, can use the same function and session field, there shouldn't be any interference.

There's nothing magic about 1000, it could be 10000 or 128, etc. But the number should not be too high (less randomness), and powers of 2 are a little more performant.