WT-3207 we fixed a situation where a thread could spin on a handle lock during checkpoints (including while holding the schema lock, blocking many other operations).
It appears that there may be some similar (but less common) source of stalls during checkpoints in a recent case with the fix for
WT-3207 in place.
- in every case there was a failed table drop and resulting closing of all cursors, and then a stall until the end of the checkpoint.
- the stall coincides with very high cpu utilization and context switch rate, and notably 3 M "pthread mutex shared lock write-lock calls" per second for the duration of the stall.
- unlike before - "time waiting for the table lock" never budges from 0 so I guess that counter is no longer hooked up in the patch build?
Looking at the code for that counter one thing that could explain this is a call to __wt_try_writelock in a tight loop. This appears to be a pure CPU loop, i.e. no calls to sched_yield, as we don't see kernel CPU utilization.
Try to reproduce this situation: insert a sleep into checkpoints, run with aggressive sweeping, try a combination of drops, creates and cursor opens. No operation should block for the duration of the checkpoint.