[SERVER-34382] Memory and swap consumption Created: 09/Apr/18  Updated: 27/Oct/23  Resolved: 03/May/18

Status: Closed
Project: Core Server
Component/s: Performance, WiredTiger
Affects Version/s: 3.4.10, 3.4.14
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Benoit Bui Assignee: Bruce Lucas (Inactive)
Resolution: Works as Designed Votes: 0
Labels: SWNA
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Capture d’écran 2018-04-18 à 10.45.49.png     PNG File Capture d’écran 2018-05-02 à 09.23.03.png     File clust-users-1-shard10-arb-2.conf     File clust-users-1-shard8-1.conf     File clust-users-2-shard6-1.conf     File clust-users-2-shard9-arb-1.conf    
Issue Links:
Depends
depends on DOCS-11672 Document WT memory usage per data handle Closed
depends on WT-4053 Add statistic for data handle size Closed
Related
Participants:

 Description   

Hi,

On a single server we're hosting 4 different mongod instances from two different sharded clusters:

  • One primary from cluster1
  • One arbiter from cluster1
  • One secondary from cluster2
  • One arbiter from cluster2

Server has following specs:
8 Cores
64GB RAM
4GB Swap

We have 20 servers with pretty much the same deployment.
One primary from one cluster.
One secondary from the other cluster
One arbiter from each clusters

We currently are facing issues with memory and swap consumption.
Usually the Primary is consuming most of the memory and most of the swap.

We recently restarted the secondary on a server and it appear the primary uses almost 40GB of RAM.

            total       used       free     shared    buffers     cached
Mem:         64347      40719      23627        452          5       2360
-/+ buffers/cache:      38353      25993
Swap:         4095       1419       2676

We tried to limit WTCacheSizeGB to 12GB per mongod few weeks ago but the issue still stands.
We tried to lower it to 10GB but we started to have slowness and perfomances issues so we went back up to 12GB.

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mongodb   5088  0.2  0.0 1021216 33016 ?       Ssl  Mar29  46:20 /opt/mongo/clust-users-1-shard10-arb-2 --config /etc/mongodb.d/clust-users-1-shard10-arb-2.conf
mongodb  24194  0.3  0.0 1025444 34460 ?       Ssl  Mar27  57:03 /opt/mongo/clust-users-2-shard9-arb-1 --config /etc/mongodb.d/clust-users-2-shard9-arb-1.conf
mongodb  12199 85.4 33.0 23322576 21776176 ?   Ssl  15:08  55:47 /opt/mongo/clust-users-2-shard6-1 --config /etc/mongodb.d/clust-users-2-shard6-1.conf
mongodb   9296  156 55.5 43653800 36571228 ?   Ssl  Mar20 46232:29 /opt/mongo/clust-users-1-shard8-1 --config /etc/mongodb.d/clust-users-1-shard8-1.conf

It can be seen that clust-users-1-shard8-1 (primary) uses 55.5% of the memory.
clust-users-2-shard6-1 (secondary) uses already 33% of memory despite being restarted less than a day ago and not being a primary at any point.

We do not understand how and why a single instance can use more than half the total memory of the server despite having set the WiredTigerCacheSizeGB.

You'll find attached the configuration files from each instances including arbiters.

Thanks,
Benoit



 Comments   
Comment by Bruce Lucas (Inactive) [ 03/May/18 ]

Hi Benoit,

As indicated above, it is expected that WiredTiger will use about 30 kB of memory that is not accounted for as part of the cache for each open table (index or collection), so you will need to account for this memory requirement if you have a large number of tables by either reducing the cache size correspondingly or by increasing the amount of memory on the machine.

I've opened DOCS-11672 to document this requirement, and WT-4053 to ensure that the metrics we collect will make it easier to diagnose this memory consumption.

Thanks for reporting this; I'll close this ticket now because this is expected behavior.

Bruce

Comment by Benoit Bui [ 03/May/18 ]

Hi Team,

The issue is becoming more and more present and has a real impact on production.

Here's a swap consumption on the last 48hrs on our 4GB swap server.

 

Memory consumption doesn't seem to have fluctuated much on this period, only swap moved that much.

On the graph, when swap goes up to 100% it's because we restart the mongo instances.

 

Note that the swap usage gap isn't due to a restart, this happens on normal run time.

As usual, I've uploaded the `diagnostic.data` directory from the server onto the secure portal.

 

Thanks again.

 

Regards,

Benoit

Comment by Ian Whalen (Inactive) [ 30/Apr/18 ]

kelsey.schubert can we assign to you?  Doesn't look this is is currently on the Storage team to do anything in it's DWS state?

Comment by Bruce Lucas (Inactive) [ 25/Apr/18 ]

Hi Benoit,

We can confirm that a memory overhead of about 30 kB per open table is expected, and this is not accounted for as part of the cache because it is not table data per se, but rather bookkeeping overhead. So for use cases such as yours where you have a very large number of collections and indexes you will need to take this into account when evaluating memory requirements.

Bruce

Comment by Benoit Bui [ 25/Apr/18 ]

Hi,

I've added new diagnostics.data archives to the secure portal. We have a server (ss2-bl9) that consumed all its swap this week up to the point a mongod was killed by oom-killer.
And regularly, more than 50% of swap is consumed.

I have a haard time to understand why it is swapping and why one mongod is using up to 60% of memory (a bit more than 40GB on our case) despite the WTCacheSize being small.

Thanks
Benoit

Comment by Bruce Lucas (Inactive) [ 19/Apr/18 ]

Hi Benoit,

Thanks for the clarification, and understood. During startup a portion of the files are opened (specifically, the indexes) to do some start-up checks. The amount of memory used then can vary later on as files are opened and closed, depending on application access patterns. We're working now to verify that the magnitude of memory used is as expected.

Thanks,
Bruce

Comment by Benoit Bui [ 19/Apr/18 ]

Hi Bruce,

I'll just add some infos here in case I did not say it or at least not clearly.
The issue we have with memory and swap isn't on startup but quite randomly.
In the previous screen capture I sent it happened at 7:25AM CEST and then at 8AM CEST.

I'll let you investigate on the data you have.

Thanks,
Benoit

Comment by Bruce Lucas (Inactive) [ 18/Apr/18 ]

Thanks for the additional information Benoit. From ps and gdb data you uploaded previously we can see that, as you suspected, the increase in memory occurred at the point where WiredTiger is opening a large number of files during startup. As you know, there is a separate WiredTiger file for each table, where a table is an index or a collection. There is some per-table overhead in WiredTiger, but we are still checking whether the amount of memory overhead per table that you are experiencing is expected.

Bruce

Comment by Benoit Bui [ 18/Apr/18 ]

Hi,

We experienced the same issue during the night.

This shows the gap in swap usage. I've added the logs from the nodes on the server and also the $dbpath/diagnostics.data directory on the secure portal.

Thanks,
Benoit

Comment by Benoit Bui [ 17/Apr/18 ]

Hi Bruce,

I've uploaded the files you requested.
It's on another server and nodes but they show similar behaviour (both clusters do).

About the 10mins without any logs, we always attributed this behaviour to the listing of databases files.
There's around 80GB to 200GB of data depending on the node and around 10k to 20k databases (or at least chunks) per node.

Thanks,
Benoit

Comment by Bruce Lucas (Inactive) [ 16/Apr/18 ]

Hi Benoit,

Thanks for the log files. They show no unusual messages but there is an unusual gap of about 10 minutes with no messages. I suspect the size of the excess may be related to the size of this gap. However during the gap a number of things may be happening that don't leave a log trace. Also, this is occurring before our internal data collection has started, so it does not provided useful information about this problem.

2018-04-09T23:30:48.209+0200 I CONTROL  [main] ***** SERVER RESTARTED *****
2018-04-09T23:30:48.217+0200 I CONTROL  [initandlisten] MongoDB starting : pid=20199 port=27205 dbpath=/data/clust-users-1-shard5-1 64-bit host=ss1-bl5
2018-04-09T23:30:48.217+0200 I CONTROL  [initandlisten] db version v3.4.10
...
2018-04-09T23:30:48.220+0200 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=12288M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2018-04-09T23:39:25.884+0200 I STORAGE  [initandlisten] Starting WiredTigerRecordStoreThread local.oplog.rs
2018-04-09T23:39:25.884+0200 I STORAGE  [initandlisten] The size storer reports that the oplog contains 15974184 records totaling to 10828896618 bytes
2018-04-09T23:39:25.884+0200 I STORAGE  [initandlisten] Sampling from the oplog between Apr  9 09:31:08:c8 and Apr  9 23:29:06:2f to determine where to place markers for truncation
2018-04-09T23:39:25.884+0200 I STORAGE  [initandlisten] Taking 1008 samples and assuming that each section of oplog contains approximately 158393 records totaling to 107374587 bytes
2018-04-09T23:39:26.553+0200 I STORAGE  [initandlisten] Placing a marker at optime Apr  9 09:49:29:4c
2018-04-09T23:39:26.553+0200 I STORAGE  [initandlisten] Placing a marker at optime Apr  9 09:59:49:85
...
2018-04-09T23:39:26.553+0200 I STORAGE  [initandlisten] Placing a marker at optime Apr  9 23:01:30:a
2018-04-09T23:39:26.553+0200 I STORAGE  [initandlisten] Placing a marker at optime Apr  9 23:15:40:27
 
   === 10-minute gap here ===
 
 
2018-04-09T23:49:11.249+0200 I CONTROL  [initandlisten] 
2018-04-09T23:49:11.249+0200 I CONTROL  [initandlisten] ** WARNING: Access control is not enabled for the database.
2018-04-09T23:49:11.249+0200 I CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted.
2018-04-09T23:49:11.249+0200 I CONTROL  [initandlisten] 

However if my belief that the memory increase is occurring during this long gap is correct, we may be able to collect some external data using ps and gdb to help understand what is happening. Would you be able to restart one of the affected mongod processes, and then immediately start some data collection as follows:

    MONGOD_PID=... # pid of the newly restarted mongod
    while true; do date; ps u $MONGOD_PID; sleep 1; done >ps.txt &
    while true; do date; gdb -p $MONGOD_PID -batch -ex 'thread apply all bt'; sleep 5; done >gdb.txt &

This will capture, in parallel, memory usage information once per second, and stack traces every 5 seconds.

Then when mongod has reached the point where it is ready to accept connections, if it follows the same pattern as before it will be using excessive memory. At this point then please upload

  • archived content $dbpath/diagnostic.data
  • mongod log covering the restart
  • ps.txt and gdb.txt collected as above

Thanks,
Bruce

Comment by Benoit Bui [ 16/Apr/18 ]

Hi Bruce,

I've added the logs covering the span you asked on the portal.

Thanks,
Benoit

Comment by Bruce Lucas (Inactive) [ 16/Apr/18 ]

Hi Benoit,

The data in clust-users-1-shard5-1 and clust-users-2-shard3-1 both show a very large excess of 10-20 GB of memory usage. In both cases the increase happened very soon after startup on 04-09 (UTC).

Can you please also upload the mongod log files for those nodes covering the time between that restart and now? This may provide information that we need to understand the reason for the large increase in memory use.

Thanks,
Bruce

Comment by Benoit Bui [ 16/Apr/18 ]

Hi Bruce,

Thanks for your answer and the secure upload portal.

We had a server on which we had the issue in the las 7 days and which had a node restarted in the meantime.
I've uploaded archives of the 4 nodes hosted on the server.

Thanks,
Benoit

Comment by Bruce Lucas (Inactive) [ 12/Apr/18 ]

Hi Benoit,

How long after restarting a node does it take to develop this problem? I'm interested in seeing the history over time of memory usage by the mongod processes; we keep about a week or so of data in diagnostic.data. So

  • If you have a node that you've restarted within the last week and is showing the problem, can you please upload the data from that node.
  • If not, would it be possible for you to restart a node (ideally the whole machine) and then when the problem develops upload the data from that node.

By "data from that node" I mean, for both mongod processes running on the node

  • an archive of the diagnostic.data directory in dbpath of each process
  • the mongod log for each process

I've generated a secure upload portal for this data.

Thanks,
Bruce

Generated at Thu Feb 08 04:36:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.