As part of debugging
WT-6729, I noticed behaviour where the eviction queue is full but workers aren't dequeuing pages and trying to evict them.
We frequently see this in test/format when the workload finishes. We shut down all threads and execute rollback to stable. With the changes in
WT-6729, we wait until eviction quiesces which it never does due to the aforementioned bug.
This is caused by multiple eviction workers taking on the role of the eviction server. I believe the offending commit is here:
Our assumption is that since we are the only worker thread in the system, we're free to release the pass lock since there's no thread to acquire it and become the server. However, the current thread count is not an accurate reflection of what is happening in the system.
When you stop a thread in a thread group, we signal to that thread to stop by clearing WT_THREAD_ACTIVE. It does not stop immediately, it just stops after the next iteration of whatever it is doing. It could easily be at the beginning of the eviction logic in which case it will see that the pass lock is free and acquire it to become the server despite the current thread count being 1.
If I make this change, I don't see the problem anymore. Obviously, this isn't the fix because we specifically released this to avoid a deadlock but it supports the theory that the releasing of the lock is getting us in trouble.
- Checkout wt-6729-hs-stop-rts.
- Add an assert like so:
- Schedule an Evergreen patch and run format stress testing.
Figure out a solution and fix the issue.