[SERVER-20248] Memory growth in __wt_session_get_btree in __checkpoint_worker under WiredTiger Created: 01/Sep/15  Updated: 11/Jan/16  Resolved: 03/Sep/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: 3.0.6

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 0
Labels: WTmem
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File close_idle_time=300.png     PNG File heap.png     PNG File heap2.png     PNG File mem.png     PNG File ss.png    
Issue Links:
Related
related to SERVER-17456 Mongodb 3.0 wiredTiger storage engine... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

This is ticket is related to the latest issue discussed on SERVER-17456. The key features of the test involve:

  • relatively large number of collections (16k)
  • 16 threads each looping through 1k of the 16k collections writing a record and doing a createIndex (on an existing index, not a new index).

Memory outside WT cache is observed to steadily grow over a period of hours in a test on 3.0.5:

  • allocated memory outside of WT cache (3rd line) grows steadily.
  • there is a sudden increase corresponding to the known issue with WT journal allocation in 3.0.5.
  • aside from that sudden jump the growth in non-cache memory is linear over time.
  • number of cached cursors is also growing and that is a known potential user of memory, but the shape of the growth curve for number of cursors does not match the linear shape of the growth of non-cached memory.

Memory profiling using tcmalloc HEAPPROFILE shows the following as a candidate for the culprit: various allocations within in __checkpoint_worker (labeled "A" below) steadily grow, and account for about 1.5 GB of non-cache memory by the end of this run:



 Comments   
Comment by Bruce Lucas (Inactive) [ 03/Sep/15 ]

After a 5+ hour run we can confirm no memory growth on 3.0.6:

  • In the last two rows we see data handles being created and freed by checkpoints.
  • The numbers are consistent with 48k handles existing for the duration of the test for the 48k tables (16k collections + 32k indexes), and an additional 48k handles being created and destroyed on each checkpoint.
  • While memory isn't growing, we're seeing significant constant memory usage outside the cache on this test:
    • virtual memory is about 5 GB vs allocated memory of about 3.5 GB. This suggests significant fragmentation accounting for about 1.5 GB of extra memory.
    • allocated memory outside the cache is about 1.5 GB; I imagine this is largely the 48k data handles?
Comment by Bruce Lucas (Inactive) [ 02/Sep/15 ]

One-hour run with close_idle_time=300 (five minutes) shows no memory growth:

This supports the theory that the issue is an accumulation of handles created by the checkpoint.

There is still a large amount of memory outside the cache. 800 MB is log slot buffer (due to the issue with that in 3.0.5), assume the rest is just due to the large number of dhandles open simply due to the 48k tables (16k collections + 32k indexes).

Next run will be on 3.0.6 to a) confirm issue still exists there and b) look at the new data-handle stats.

Comment by Bruce Lucas (Inactive) [ 02/Sep/15 ]

A couple improvements to the tooling more clearly show about 3 GB of memory allocated by __conn_dhandle_get within a checkpoint ("A" below), presumably not accounted for in the WT cache, rising linearly over the course of a 3-hour run.

This run had syncdelay set to 5 seconds vs default 60 seconds. That did not clearly increase the rate of memory increase, but I think that's because checkpoints were taking a very long time so the number of checkpoints was about the same in either case.

Generated at Thu Feb 08 03:53:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.