Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: WT10.0.1, 4.4.7, 5.0.0-rc0
Affects Version/s: None
Component/s: None
Labels:
- dh50proj

Sprint:
Storage - Ra 2021-03-22, Storage - Ra 2021-04-05, Storage - Ra 2021-04-19, Storage - Ra 2021-05-03
Story Points:
8

Test (attached) is heavy update workload in PSA replica set, emrc:t, S down, which stresses the history store. Configured with 2 GB cache, which would be typical of a machine with ~4 GB of memory.

Resident memory usage rises, reaching ~4 GB, at which point I abruptly terminated mongod at B to simulate crash due to OOM that could occur (did not hit actual OOM because was run on a machine with >>4 GB of memory)
During recovery oplog application after B ("opcountersRepl update") we see a similar (actually somewhat worse) pattern, and would again hit OOM and crash around C, before recovery completes
Increase in resident memory is due to accumulation of pageheap_free_bytes, indicative of memory fragmentation
The step function increases in fragmentation occur during checkpoints, when we allow dirty cache content to rise to >140% of configured cache size.

Fragmentation can occurs when a large amount of memory is allocated in small regions, such as update structures associated with dirty content, then is freed but cannot be re-used for large structures such as pages read from disk. We put in place mechanisms to limit this fragmentation by limiting dirty cache to 20% and update structures to 10% of cache, and I suspect that by allowing dirty cache content to greatly exceed these limits we are creating excessive memory fragmentation.

Note that ~~WT-6924~~ was put into place to eliminate the very large spikes of dirty content we were seeing before to many times the cache size, but I'm not sure why we are still allowing dirty content to entirely fill the cache (and more) rather than limiting it to 20% as we normally do.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

repro1.js
1 kB
Feb 11 2021 03:39:48 PM UTC
repro.sh
2 kB
Feb 11 2021 03:39:48 PM UTC
image-2021-04-07-15-49-57-323.png
209 kB
Apr 07 2021 05:49:59 AM UTC
image-2021-04-07-15-49-30-344.png
118 kB
Apr 07 2021 05:49:31 AM UTC
image-2021-04-06-15-06-04-825.png
138 kB
Apr 06 2021 05:06:07 AM UTC
image-2021-03-31-14-29-38-862.png
443 kB
Mar 31 2021 03:29:42 AM UTC
image-2021-03-09-13-09-29-379.png
415 kB
Mar 09 2021 02:09:32 AM UTC
image-2021-03-09-13-09-22-280.png
415 kB
Mar 09 2021 02:09:25 AM UTC
fragmentation.png
192 kB
Feb 11 2021 03:05:03 PM UTC

is related to

WT-6175 tcmalloc fragmentation is worse in 4.4 with durable history

Closed

WT-6924 Queue history store pages for urgent eviction when cache pressure is high

Closed

WT-7168 History store ignores cache size during update heavy workload

Closed

related to

WT-7848 Re-enable cache usage check in hs_cleanup_stress and disable hs_cleanup_default

Closed

Assignee:: Haseeb Bokhari (Inactive)
Reporter:: Bruce Lucas (Inactive)
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Feb 11 2021 03:04:36 PM UTC
Updated:: Oct 29 2023 04:42:26 PM UTC
Resolved:: Apr 21 2021 05:58:23 AM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates