-
Type:
Bug
-
Resolution: Cannot Reproduce
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
5
-
Storage - Ra 2020-06-15
While investigating WT-6257 I've seen a lot of cases failing with cache stuck full. The configuration I'm running is:
./t -h WT_TEST -t r -r 2 -W 8 -n 400000 -k 100000 -C "cache_size=100MB"
The failure and cache dump looks like:
[1590803721:854510][30113:0x7ff5bd0cc700], t, eviction-server: __evict_server, 453: Cache stuck for too long, giving up: Connection timed out [1590803721:854633][30113:0x7ff5bd0cc700], t, eviction-server: __wt_verbose_dump_cache, 2602: =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= [1590803721:854644][30113:0x7ff5bd0cc700], t, eviction-server: __wt_verbose_dump_cache, 2603: cache dump [1590803721:854651][30113:0x7ff5bd0cc700], t, eviction-server: __wt_verbose_dump_cache, 2605: cache full: no [1590803721:854658][30113:0x7ff5bd0cc700], t, eviction-server: __wt_verbose_dump_cache, 2607: cache clean check: yes (95.003%) [1590803721:854668][30113:0x7ff5bd0cc700], t, eviction-server: __wt_verbose_dump_cache, 2609: cache dirty check: yes (94.677%) [1590803721:854675][30113:0x7ff5bd0cc700], t, file:__wt0002.wt, eviction-server: __verbose_dump_cache_single, 2481: file:__wt0002.wt(<live>): [1590803721:854711][30113:0x7ff5bd0cc700], t, file:__wt0002.wt, eviction-server: __verbose_dump_cache_single, 2532: internal: 1 pages, 0MB, 0/1 clean/dirty pages, 0/0 clean/dirty MB, 0MB max page, 0MB max dirty page [1590803721:854718][30113:0x7ff5bd0cc700], t, file:__wt0002.wt, eviction-server: __verbose_dump_cache_single, 2546: leaf: 2 pages, 0MB, 2/0 clean/dirty pages, 0/0 clean/dirty MB, 0MB max page, 0MB max dirty page [1590803721:854723][30113:0x7ff5bd0cc700], t, file:__wt0001.wt, eviction-server: __verbose_dump_cache_single, 2481: file:__wt0001.wt(<live>): [1590803721:854733][30113:0x7ff5bd0cc700], t, file:__wt0001.wt, eviction-server: __verbose_dump_cache_single, 2532: internal: 1 pages, 0MB, 0/1 clean/dirty pages, 0/0 clean/dirty MB, 0MB max page, 0MB max dirty page [1590803721:854738][30113:0x7ff5bd0cc700], t, file:__wt0001.wt, eviction-server: __verbose_dump_cache_single, 2534: leaf: 0 pages [1590803721:854742][30113:0x7ff5bd0cc700], t, file:__wt0000.wt, eviction-server: __verbose_dump_cache_single, 2481: file:__wt0000.wt(<live>): [1590803721:854762][30113:0x7ff5bd0cc700], t, file:__wt0000.wt, eviction-server: __verbose_dump_cache_single, 2532: internal: 1 pages, 0MB, 0/1 clean/dirty pages, 0/0 clean/dirty MB, 0MB max page, 0MB max dirty page [1590803721:854766][30113:0x7ff5bd0cc700], t, file:__wt0000.wt, eviction-server: __verbose_dump_cache_single, 2546: leaf: 2 pages, 0MB, 2/0 clean/dirty pages, 0/0 clean/dirty MB, 0MB max page, 0MB max dirty page [1590803721:854771][30113:0x7ff5bd0cc700], t, file:WiredTigerHS.wt, eviction-server: __verbose_dump_cache_single, 2481: file:WiredTigerHS.wt(<live>): [1590803721:854809][30113:0x7ff5bd0cc700], t, file:WiredTigerHS.wt, eviction-server: __verbose_dump_cache_single, 2532: internal: 1 pages, 0MB, 1/0 clean/dirty pages, 0/0 clean/dirty MB, 0MB max page, 0MB max dirty page [1590803721:854819][30113:0x7ff5bd0cc700], t, file:WiredTigerHS.wt, eviction-server: __verbose_dump_cache_single, 2546: leaf: 1 pages, 87MB, 0/1 clean/dirty pages, 0/87 clean/dirty MB, 87MB max page, 87MB max dirty page [1590803721:854827][30113:0x7ff5bd0cc700], t, file:WiredTiger.wt, eviction-server: __verbose_dump_cache_single, 2481: file:WiredTiger.wt(<live>): [1590803721:854835][30113:0x7ff5bd0cc700], t, file:WiredTiger.wt, eviction-server: __verbose_dump_cache_single, 2532: internal: 1 pages, 0MB, 1/0 clean/dirty pages, 0/0 clean/dirty MB, 0MB max page, 0MB max dirty page [1590803721:854840][30113:0x7ff5bd0cc700], t, file:WiredTiger.wt, eviction-server: __verbose_dump_cache_single, 2534: leaf: 0 pages [1590803721:854844][30113:0x7ff5bd0cc700], t, eviction-server: __wt_verbose_dump_cache, 2623: cache dump: total found: 94MB vs tracked inuse 95MB [1590803721:854848][30113:0x7ff5bd0cc700], t, eviction-server: __wt_verbose_dump_cache, 2624: total dirty bytes: 87MB [1590803721:854853][30113:0x7ff5bd0cc700], t, eviction-server: __wt_evict_thread_run, 328: cache eviction thread error: Connection timed out
The core issue here is that we've let a history store page grow to 87MB:
[1590803721:854819][30113:0x7ff5bd0cc700], t, file:WiredTigerHS.wt, eviction-server: __verbose_dump_cache_single, 2546: leaf: 1 pages, 87MB, 0/1 clean/dirty pages, 0/87 clean/dirty MB, 87MB max page, 87MB max dirty page
I have added a thread to test/checkpoint that lags the oldest transaction ID, to make it more commonly use the history store, which I suspect is contributing to seeing (and reproducing) the failure. Diff attached.
Last week we were evicting history store pages too aggressively - we addressed that in WT-6306 and WT-6229, but I suspect we've gone too far the other way now, and need to figure out how/when to get exclusive access to history store pages, so we can keep them to a reasonable size.
sulabh.mahajan have you seen any evidence of large history store pages?
- related to
-
WT-6305 Classify the format stress test timeouts
-
- Closed
-