The issue was initially reported as an unusually long load phase in py-tpcc workloads. The issue appears intermittently as a few inserts in the load phase will get stuck for more than a few hours. Related PERF and HELP tickets have more information on the history of the issue.
bruce.lucas has come up with a standalone reproducer, attached to the ticket.
When inserting into a unique index, there is potential to get stuck repeatedly searching history store, ie calling __wt_hs_find_upd. We see a very high history store table reads missed statistic in these runs, which convey that these searches through history store are not returning anything. A callgraph that reflects this situation:
Observed behaviour with the repro script:
- Observed behaviour in 4.4 is that insert rate is erratic, and a couple of the threads typically seem to get stuck apparently indefinitely with a high rate of missed history store reads with stacks like the above.
- With the repro script: Expected behaviour (same as observed in 4.2) is that insert rate should be steady and each thread should complete at about the same time
- py-tpcc load phase should not get stuck