[SERVER-17424] WiredTiger uses substantially more memory than accounted for by cache Created: 02/Mar/15  Updated: 29/Oct/18  Resolved: 19/Jun/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: 3.0.3

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 8
Labels: wttt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File cache.png     PNG File cache_overhead=30.png     PNG File noncache.png     File perf-wt-reverse.html.gz     PNG File perf-wt-reverse.png     PNG File repro.png     PNG File simple.png     Text File ss.log     PNG File sysbench.png    
Issue Links:
Depends
is depended on by WT-1901 Issues resolved in WiredTiger 2.5.3 Closed
Duplicate
is duplicated by SERVER-17421 WiredTiger b-tree uses much more memo... Closed
is duplicated by SERVER-18231 Primary is unable to be reached when ... Closed
is duplicated by SERVER-18241 MongoDB 3.0 + WiredTiger process uses... Closed
Related
related to SERVER-16977 Memory increase trend when running ha... Closed
related to SERVER-17456 Mongodb 3.0 wiredTiger storage engine... Closed
related to SERVER-17495 Stand alone mongod throughout dropped... Closed
related to SERVER-17386 Cursor cache causes excessive memory ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Running sysbench shows the following timeline:

  • At A the cache has reached its configured limit (actually 85%, I believe). At B the insert phase of the benchmark ends and the update phase begins.
  • Second row shows allocated bytes as reported by tcmalloc. Per documentation I believe this is the bytes as requested by the app so does not include any tcmalloc overhead. This reaches approximately 6 GB.
  • Third row shows bytes in cache as reported by WT, which reases 5 GB, about 20% less than the allocated bytes.
  • Last row shows difference between bytes in cache and allocated bytes. This grows in proportion to the number of bytes in the cache at a rate of about 20%, and stops growing at the point where bytes in cache stops growing. This suggests that WT allocates about 20% more bytes than are accounted for in the "bytes currently in cache" statistic when it is doing inserts.

Call stack data was obtained by using perf to monitor calls to all tmalloc entry points. Here's a partially expanded reversed call tree for a shorter run up to 449 MB bytes allocated by WT. Timeline shows currently active memory (i.e. accounting for both allcoate and free) charged to each call site over the course of the run. Read max bytes active throughout the run for each call site from the "max.MB" column. Full call tree attached as well.



 Comments   
Comment by Adrien Jarthon [ 29/Oct/18 ]

Ok done → SERVER-37795

Comment by Michael Cahill (Inactive) [ 28/Oct/18 ]

bigbourin@gmail.com, that is not normal. Please open a new ticket with details, including diagnostic data.

Comment by Adrien Jarthon [ 27/Oct/18 ]

Hi, is it considered normal that on 3.6.8 I see a memory usage overhead of 90% compared to the configured max cache size?

My configured max cache is 10G:

rs0:PRIMARY> db.serverStatus().version
3.6.8
rs0:PRIMARY> db.serverStatus().wiredTiger.cache["maximum bytes configured"]
10737418240

And my actual usage is 19G RES (21G VIRT):

rs0:PRIMARY> db.serverStatus().tcmalloc.tcmalloc.formattedString
------------------------------------------------
MALLOC:    16556355400 (15789.4 MiB) Bytes in use by application
MALLOC: +   3597598720 ( 3430.9 MiB) Bytes in page heap freelist
MALLOC: +    227350168 (  216.8 MiB) Bytes in central cache freelist
MALLOC: +       884736 (    0.8 MiB) Bytes in transfer cache freelist
MALLOC: +    290449952 (  277.0 MiB) Bytes in thread cache freelists
MALLOC: +     67035392 (   63.9 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =  20739674368 (19778.9 MiB) Actual memory used (physical + swap)
MALLOC: +   1484984320 ( 1416.2 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =  22224658688 (21195.1 MiB) Virtual address space used
MALLOC:
MALLOC:         286938              Spans in use
MALLOC:            886              Thread heaps in use
MALLOC:           4096              Tcmalloc page size
------------------------------------------------

This has increased since 3.4 (another regression) and seem a bit excessive compared to the number announced in this ticket (and the others) for 3.0 and 3.2.

Comment by Bruce Lucas (Inactive) [ 19/Jun/15 ]

With the accounting change that went into 3.0.3 the overuse of memory fell from about 18% to about 4%, which is now on the same order as other non-cache uses of memory, and is no longer substantial overuse.

Comment by Bruce Lucas (Inactive) [ 18/Jun/15 ]

Behavior is considerably improved in 3.0.3. Running the repro above on a large machine until amount of memory in cache is 40 GB:

       cache   alloc   virtual   alloc/   virtual/
       (MB)    (MB)    (MB)      cache    cache
3.0.3  41061   42768   43538     1.042    1.060
3.0.0  40980   48551   49269     1.185    1.202

In 3.0.0, the excess memory consumed was about 19% (tcmalloc-reported allocated memory) or 20% (virtual memory). In 3.0.3 that has been reduced to about 4% and 6% respectively.

The 6% virtual memory overage will include memory in free pools and non-allocated overhead such as stacks that WT cannot be expected to account for.

The 4% allocated memory overage will include some fixed overhead allocated outside of WT, but that should be negligible relative to the 40GB, so presumably this mostly represents the approximate remaining underaccounting of memory by WT, which I believe includes things like internal fragmentation and other allocator overhead.

Comment by Ramon Fernandez Marina [ 15/May/15 ]

pauloandre1987@gmail.com, this issue has been scheduled for the current development cycle. Updates will be posted here as they become available, feel free to watch this ticket to be notified of updates.

Regards,
Ramón.

Comment by Paulo [ 15/May/15 ]

Hi,
Any news about this issue?

Paulo

Comment by Matt Parlane [ 27/Apr/15 ]

I've just started the logging now.

I'm not sure why I didn't think of this before, but I just realised that the machine is also running both MMS backup and monitoring daemons. I was thinking about what would cause the increase in open cursors and I realised they were both still running. Apologies for not clicking before, sorry if I've wasted your time.

I've just stopped the daemons, I'll see if memory usage still grows.

Matt

Comment by David Hows [ 27/Apr/15 ]

Hi Matt,

Thanks for uploading that information and sorry for the delay in getting back to you. Here are some numbers (presenting them here in text form because they don't vary much over the course of the data that you collected):

resident memory                            1544 MB
tcmalloc current_allocated_bytes           1403 MB
WT cache: bytes currently in the cache      819 MB
 
WT open cursor count                      21396

The resident memory (1544 MB) will include

  • non-allocated memory such as stacks and code
  • allocated memory (1403 MB), which includes
    • WT cache (819 MB)
    • other non-cache data structures

This ticket tracks the fact that bytes allocated for the WT cache (that is, the WT cache portion of tcmalloc current_allocated_bytes) are about 20 or 30% higher than the amount that WT accounts for in its "bytes currently in the cache" statistic. However the different between tcmalloc current_allocated_bytes (1403 MB) and WT bytes currently in the cache (819 MB) in your case is much larger than that 30%, so I suspect you are actually seeing the effect of other non-cache allocated data structures. In addition the number of cursors is quite large (21k), so you may also be seeing some memory usage due to excessive caching of cursors (traced by SERVER-17386).

If you could do a longer serverStatus data collection where we will see a larger variation over time in memory usage, WT cache usage, cursor count, and possibly other counters, we may be able to get a clearer picture of what correlates with the memory usage in your workload. Could you collect serverStatus data as follows?

  • restart mongod
  • then immediately start serverStatus data collection while the cache is still empty, adjusting the sampling frequency to say 30 seconds to account for the much longer period of data collection

    mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(30*1000)}" >ss.log &
    

  • let it run for a much longer period of time, if possible long enough to see rss creep up to the 6 GB figure you mentioned above.

Regards,
David

Comment by Matt Parlane [ 13/Apr/15 ]

The secondary is purely an offsite backup, nothing performs reads on it. I've got a script which shuts down the server every night, takes a snapshot of the data directory, and restarts it. The script is probably the reason I didn't pick up the leak till now – I was inadvertently working around it.

The machine is currently up-to-date, so it's applying updates more slowly than it would be if it was having to catch up – when I came in this morning it was sitting at about 6GB RSS, and I restarted it which brought it back to the normal level (whatever that is, I can't remember).

When I started the process, the RSS was at 1.459GB, when it finished it was at 1.506GB. From what I've seen it would have kept creeping up indefinitely. As I type this it's at 1.516GB.

Let me know if I can provide anything else. I can provide access to the server if required.

Comment by David Hows [ 13/Apr/15 ]

Hi Matt,

Sounds like you have something interesting here and I want to follow up.

The graphing utility is not available publicly, but I can share with you the following which is used to gather the required data to generate it from your MongoDB instances.

mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(1000)}" >ss.log &

Would you be able to reproduce the load in question and run the snippet above against the secondary that is resyncing. The attach the resulting ss.log file and I will generate the graphs?

In addition, can you give me a bit of insight into the workload here? Does the syncing secondary have any kind of read load routed to it? And what kind of load is on the primary that would be passed on to the secondary? Is it primarily inserts, updates or deletes that would be part of this workload?

Thanks,
David

Comment by Matt Parlane [ 10/Apr/15 ]

I'm possibly seeing the same issue while a replica set secondary is catching up on oplog updates. Within about 30 minutes of starting, the secondary has consumed almost all the RAM on the system (8GB) and eventually runs into allocation errors.

I'm running 3.0.2 with WiredTiger, with a 1GB cache size (also tried 2, 3 and 4 with same results).

uname -a:
Linux kapiti 3.13.0-39-generic #66-Ubuntu SMP Tue Oct 28 13:30:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

See here for more info: https://groups.google.com/d/msg/mongodb-dev/q5YFjI16ZpA/eMvIkq115DoJ

Please let me know if I can provide more info – perhaps someone could tell me how to make those cool graphs?

Comment by Nick Judson [ 16/Mar/15 ]

Possible dup: SERVER-17386

Comment by Bruce Lucas (Inactive) [ 02/Mar/15 ]

Here's a simple repro. The issue seems to depend on the random inserts into indexes done by sysbench. I'm not sure whether the issue is restricted to that code path per se, or whether that is simply required to fill the cache while doing inserts (as the sequential inserts into the collection and the _id index don't fill the cache). We make the issue happen quickly by doing random inserts into 50 indexes from 10 threads.

    # parameters
    keys=50
    threads=10
    count=50000
 
    # start mongod
    rm -rf db db.log; mkdir -p db
    mongod --dbpath db --logpath db.log --fork --storageEngine wiredTiger
 
    # setup
    mongo --eval "
        for (var k=0; k<$keys; k++) {
            inx = {}
            inx['k'+k] = 1
            db.c.ensureIndex(inx)
        }
    "
 
    # collect tcmalloc data
    mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(1000)}" >ss.log &
 
    # run test
    (
        for i in $(seq $threads); do
            mongo --eval "
                for (var i=0; i<$count; ) {
                    var bulk = db.c.initializeUnorderedBulkOp();
                    r = Math.random()
                    for (var j=0; j<1000; j++, i++) {
                        doc = {}
                        for (var k=0; k<$keys; k++)
                            doc['k'+k] = r
                        bulk.insert(doc)
                    }
                    bulk.execute();
                    print(i)
                }
            " &
        done
        wait
    )
 
    # stop tcmalloc data collection
    killall mongo

Result below. As with the sysbench test, we see about 20% or so more bytes allocated than are reported to be in the cache.

Generated at Thu Feb 08 03:44:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.