Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Done
Fix Version/s: WT2.4.0
Affects Version/s: None
Component/s: None
Labels:
- Performance

Sprint:
None
Story Points:
None

I ran into a curious scalability issue when running wtperf small-btree workload. I was trying to see how I could improve in-cache reads and saw that we are losing efficiency as we increase the number of threads. So I began investigating where this loss of efficiency is coming from.

I found that as we add more threads we spend increasingly more time in the __wt_random function. This function is not used for searching the btree, but for generating keys in the wtperf benchmark. That is why I call this a "quasi-issue", because real workloads probably won't be affected by what happens inside random(), but if we were ever to evaluate or report scalability with wtperf, this issue could prevent us from seeing real bottlenecks or lack thereof. So I thought I'd mention this to you anyway.

Long story short, the problem with __wt_random was the sharing of variables m_w and m_z by different threads:

uint32_t
__wt_random(void)
{
static uint32_t m_w = 521288629;
static uint32_t m_z = 362436069;
uint32_t w = m_w, z = m_z;

m_z = z = 36969 * (z & 65535) + (z >> 16);
m_w = w = 18000 * (w & 65535) + (w >> 16);
return (z << 16) + (w & 65535);
}

I wanted to see what will happen if we instead used thread-local copies of those values. That was easy enough to implement in wtperf. I simply added variables m_w and m_z to a thread-local CONFIG struct. And I also wrote a random function that's the exact copy of the one above except that pointers to thread-local copies of m_w and m_z are passed as arguments and the corresponding values are then manipulated inside the function:

static uint32_t
__wtperf_random(uint32_t *m_w, uint32_t *m_z)
{
_m_z = 36969 _ (_m_z & 65535) + (_m_z >> 16);
_m_w = 18000 _ (_m_w & 65535) + (_m_w >> 16);
return (_m_z << 16) + (_m_w & 65535);
}

I ran some performance tests and I am seeing 3% improvement for a workload with 8 threads (2% stdev) and 5% improvement for a workload with 12 threads (0.5% stdev). The gains will likely get larger as we use more threads, but I have not yet had a chance to test this on a larger machine.

related to

WT-1224 Fix shared variable collisions in random number generation.

Closed

Assignee:: Alexandra (Sasha) Fedorova
Reporter:: Keith Bostic (Inactive)
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Sep 13 2014 11:27:43 AM UTC
Updated:: Apr 16 2015 08:41:41 PM UTC
Resolved:: Apr 09 2015 01:08:33 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates