-
Type: Task
-
Resolution: Done
-
Affects Version/s: None
-
Component/s: None
fedorova says:
I discovered that txn_global->scan_count updates limit performance on read-only workloads.
The updates or reads are performed in txn.c in lines 110, 113 and 128 in wt_txn_refresh function.
Perf output shows that with 2 threads wt_txn_refresh is not a major contributor to performance:
62.31% db_bench_wiredt libwiredtiger-1.6.7.so [.] __wt_row_search
5.14% db_bench_wiredt libc-2.17.so [.] vfprintf
4.88% db_bench_wiredt libwiredtiger-1.6.7.so [.] __wt_hazard_clear
4.28% db_bench_wiredt libwiredtiger-1.6.7.so [.] __clsm_search
But with 16 threads, it is:
38.12% db_bench_wiredt libwiredtiger-1.6.7.so [.] __wt_row_search
21.24% db_bench_wiredt libwiredtiger-1.6.7.so [.] __wt_txn_refresh
15.04% db_bench_wiredt libwiredtiger-1.6.7.so [.] __clsm_search
6.61% db_bench_wiredt libwiredtiger-1.6.7.so [.] __wt_eviction_check
2.91% db_bench_wiredt libwiredtiger-1.6.7.so [.] __wt_btcur_search
I used my data-sharing tool to determine that the variable txn_global->scan_count is the problem.
We atomically update the scan_count to prevent txn_global->oldest_id from moving forward as we are scanning the tree. Correct me if I'm wrong, but oldest_id does not change if we have a purely read-only workload. So I'm thinking that perhaps we can add a special-case "hack" for read-only workloads here?
In particular, __wt_txn_refresh runs the following code after incrementing txn_global->scan_count:
/_ The oldest ID cannot change until the scan count goes to zero. _/
prev_oldest_id = txn_global->oldest_id;
current_id = snap_min = txn_global->current;
/_ For pure read-only workloads, use the last cached snapshot. _/
if (get_snapshot &&
txn->id == max_id &&
txn->snapshot_count == 0 &&
txn->snap_min == snap_min &&
TXNID_LE(prev_oldest_id, snap_min))
This is a "shortcut" for read-only workloads. This shortcut runs after we increment scan_count, so it doesn't help. But perhaps we could, with a little hack, execute this shortcut earlier, before we increment the scan_count.
Perhaps we could do something like this: Before incrementing the scan count attempt to execute that if-statement. Of course we need to ensure that oldest_id hasn't changed underneath us. So at the end of the if-statement we could read again the value of txn_global->oldest_id and compare if to the value that we read earlier: prev_oldest_id. If the value has not changed, the if-statement comparison was valid, and we can return from that function without ever incrementing the scan_count. If it has changed, we were wrong assuming that this was a read-only workload, so we execute the slow-path, i.e., increment the scan_count and only then proceed to the above if-statement.
We may need to save and re-check a few other variables as well. And this hack assumes that transaction IDs are incremented monotonically (to avoid the ABA problem), and that a load is an atomic operation (which I think is a safe bet).
WDYT?
Note, I have not yet actually tried this hack to see if this breaks things. I wanted to run this by you to see if you would completely discard this as being wrong and if you'd have better ideas for fixing this bottleneck.
Let me know if you need more information.
- is related to
-
WT-794 Eviction preventing the fast path for read-only workloads in __wt_txn_refresh.
- Closed