I ran into a curious scalability issue when running wtperf small-btree workload. I was trying to see how I could improve in-cache reads and saw that we are losing efficiency as we increase the number of threads. So I began investigating where this loss of efficiency is coming from.
I found that as we add more threads we spend increasingly more time in the __wt_random function. This function is not used for searching the btree, but for generating keys in the wtperf benchmark. That is why I call this a "quasi-issue", because real workloads probably won't be affected by what happens inside random(), but if we were ever to evaluate or report scalability with wtperf, this issue could prevent us from seeing real bottlenecks or lack thereof. So I thought I'd mention this to you anyway.
Long story short, the problem with __wt_random was the sharing of variables m_w and m_z by different threads:
static uint32_t m_w = 521288629;
static uint32_t m_z = 362436069;
uint32_t w = m_w, z = m_z;
m_z = z = 36969 * (z & 65535) + (z >> 16);
m_w = w = 18000 * (w & 65535) + (w >> 16);
return (z << 16) + (w & 65535);
I wanted to see what will happen if we instead used thread-local copies of those values. That was easy enough to implement in wtperf. I simply added variables m_w and m_z to a thread-local CONFIG struct. And I also wrote a random function that's the exact copy of the one above except that pointers to thread-local copies of m_w and m_z are passed as arguments and the corresponding values are then manipulated inside the function:
__wtperf_random(uint32_t *m_w, uint32_t *m_z)
_m_z = 36969 _ (_m_z & 65535) + (_m_z >> 16);
_m_w = 18000 _ (_m_w & 65535) + (_m_w >> 16);
return (_m_z << 16) + (_m_w & 65535);
I ran some performance tests and I am seeing 3% improvement for a workload with 8 threads (2% stdev) and 5% improvement for a workload with 12 threads (0.5% stdev). The gains will likely get larger as we use more threads, but I have not yet had a chance to test this on a larger machine.