Affects Version/s: None
Fix Version/s: WT2.2
There is an application that can cause WiredTiger to hang reliably.
The requirements are:
Connection configuration string: "cache_size=1MB"
Create configuration string: "key_format=r,value_format=i"
Create a thread, that inserts 1 million items out of order into the table.
Create another thread, that does a sequential read through the table.
The interesting parts of this configuration is that the default leaf_page_max is 1MB (the cache size). We eventually get into a state where the cache has:
- A couple of metadata pages that can never be evicted.
- A root page for the tree that can never be evicted.
- 3 pages with type WT_PAGE_COL_INT. Sizes of 256, 256 and 192 bytes. All of these are split merge pages and cannot be chosen for eviction (the maximum level is 2, so they don't meet the threshold).
- One page with type WT_PAGE_COL_VAR. Size 1311495. Since there is a cursor reading and a cursor writing to this page, we never get a chance to evict this page.
This is backed up by a dump of the connection statistics after the application becomes hung:
Mar 19 16:11:10 0 cache: unmodified pages evicted
Mar 19 16:11:10 0 cache: modified pages evicted
Mar 19 16:11:10 1690764 cache: pages selected for eviction unable to be evicted
Mar 19 16:11:10 1690764 cache: hazard pointer blocked page eviction
Mar 19 16:11:10 153729 cache: eviction server unable to reach eviction goal
I think it's OK to not make progress in this situation, but we should recognize it and give a reasonable error (even if we panic the connection). We set the WT_EVICTION_STUCK flag in this case, but it's not clear to me what a reasonable test to tell that we are genuinely stuck is.