[SERVER-61150] Primay Config Server cost too much memory and leads to occur OOM Created: 31/Oct/21  Updated: 29/Nov/21  Resolved: 29/Nov/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.0.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: A A Assignee: Edwin Zhou
Resolution: Done Votes: 0
Labels: Bug, configsvr, memory-leak
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

I can't reproduce that perfectly. 

Participants:

 Description   

HI there,I have a group of mongoDB shard cluster in a docker clusters, with 2 mongos members, 1 config server (total: 3 replica members) and 2 shards(each shard also has 3 replica members.). Each member is running in one docker container. Unclear to know why, my primary config server was killed by OOM someday.

I have to increase much more memory for the config server to prevent it from OOM again, even I have given a wiredTiger cache limit like below in the config file. At first time, the config server running ok with 2G memory, but now, it's 8G and has been used almost 99%. And each of the two shard hold only 28G data at most.

# wiredTiger memory limit in  config file
wiredTigerCacheSizeGB = 1
---
# docker stats $dockerId
CONTAINER           CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O             PIDS
d-f2uxxxp41y        3.86%               7.943 GiB / 8 GiB   99.29%              0 B / 0 B           39.52 GB / 1.539 TB   0

Here is what I can find in mongod log and db.serverStatus output. And I know the OS may take over all the memory which wiredTiger can't allocate, but I want to know how the OS could use so much and for what purpuse. So where I can figure it out? I hope any "comrade" here can offer some help:

 

# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  11684  1308 ?        Ss    2020   0:00 /bin/bash /entrypoint.sh
root       160  0.0  0.0  20612   804 ?        Ss    2020  12:33 crond
root       161 10.8  2.8 15042880 7611028 ?    SLl   2020 93735:18 /opt/jmiss_mongo/mongod --config /var/jmiss_mongo/config/configsvr.conf
---
# mongostat
insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn                        set repl                time
    *0     9     *0     *0       0    11|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0   896b   68.7k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:41.036
    *0     2      2     *0      11    16|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0  12.7k   80.1k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:42.036
    *0    *0     *0     *0       0    11|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0  1.21k   70.4k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:43.036
    *0    *0     *0     *0       0     9|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0   793b   69.6k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:44.036
    *0    *0     *0     *0       1    11|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0  1.15k   68.2k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:45.035
    *0    *0     *0     *0       5    11|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0  5.71k   72.2k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:46.036
    *0    *0     *0     *0       1    11|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0  1.15k   68.5k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:47.035
    *0    *0     *0     *0       0     7|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0  1.17k   69.4k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:48.036
    *0    *0     *0     *0       1    13|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0  1.27k   71.0k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:49.034
    *0    *0     *0     *0       0     6|0  0.0% 73.3%       1 14.3G 7.24G 0|0 1|0   920b   66.7k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:50.036

 

 

mgset-172xxx_configsvr:PRIMARY> show dbs
admin   0.000GB
config  0.149GB
local   2.596GB
mgset-172xxx_configsvr:PRIMARY> db.serverStatus().mem
{
  "bits" : 64,
  "resident" : 7432,
  "virtual" : 14690,
  "supported" : true,
  "mapped" : 0,
  "mappedWithJournal" : 0
}
mgset-172xxx_configsvr:PRIMARY> db.serverStatus().tcmalloc
{
  "generic" : {
      "current_allocated_bytes" : NumberLong("12324 143616"),
      "heap_size" : NumberLong("13836734464")
  },
  "tcmalloc" : {
      "pageheap_free_bytes" : 154 181632,
      "pageheap_unmapped_bytes" : 1024 569344,
      "max_total_thread_cache_bytes" : NumberLong(1073 741824),
      "current_total_thread_cache_bytes" : 24 271352,
      "total_free_bytes" : 333 839872,
      "central_cache_free_bytes" : 306 076200,
      "transfer_cache_free_bytes" : 3 492320,
      "thread_cache_free_bytes" : 24 271352,
      "aggressive_memory_decommit" : 0,
      "pageheap_committed_bytes" : NumberLong("12812 165120"),
      "pageheap_scavenge_count" : 20 856825,
      "pageheap_commit_count" : 27 858250,
      "pageheap_total_commit_bytes" : NumberLong("21789588 615168"),
      "pageheap_decommit_count" : 20 856825,
      "pageheap_total_decommit_bytes" : NumberLong("21776776 450048"),
      "pageheap_reserve_count" : 6709,
      "pageheap_total_reserve_bytes" : NumberLong("13836 734464"),
      "spinlock_total_delay_ns" : NumberLong("665771 200416"),
 
      ------------------------------------------------
      MALLOC:    12290579360 (11721.2 MiB) Bytes in use by application
      MALLOC: +    187674624 (  179.0 MiB) Bytes in page heap freelist
      MALLOC: +    306094552 (  291.9 MiB) Bytes in central cache freelist
      MALLOC: +      3133024 (    3.0 MiB) Bytes in transfer cache freelist
      MALLOC: +     24618024 (   23.5 MiB) Bytes in thread cache freelists
      MALLOC: +    112812288 (  107.6 MiB) Bytes in malloc metadata
      MALLOC:   ------------
      MALLOC: =  12924911872 (12326.2 MiB) Actual memory used (physical + swap)
      MALLOC: +   1024634880 (  977.2 MiB) Bytes released to OS (aka unmapped)
      MALLOC:   ------------
      MALLOC: =  13949546752 (13303.3 MiB) Virtual address space used
      MALLOC:
      MALLOC:        1696320              Spans in use
      MALLOC:             96              Thread heaps in use
      MALLOC:           4096              Tcmalloc page size
      ------------------------------------------------
      Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
      Bytes released to the OS take up virtual address space but no physical memory.
 
  }
}
mgset-172xxx_configsvr:PRIMARY> db.serverStatus().wiredTiger.cache
{
  "application threads page read from disk to cache count" : 23400,
  "application threads page read from disk to cache time (usecs)" : 3238111,
  "application threads page write from cache to disk count" : 44687084,
  "application threads page write from cache to disk time (usecs)" : 1003631736,
  "bytes belonging to page images in the cache" : 261364892,
  "bytes belonging to the cache overflow table in the cache" : 182,
  "bytes currently in the cache" : 773199299,
  "bytes not belonging to page images in the cache" : 511834406,
  "bytes read into cache" : 619883360,
  "bytes written from cache" : 723664179500,
  "cache overflow cursor application thread wait time (usecs)" : 0,
  "cache overflow cursor internal thread wait time (usecs)" : 0,
  "cache overflow score" : 0,
  "cache overflow table entries" : 0,
  "cache overflow table insert calls" : 0,
  "cache overflow table remove calls" : 0,
  "checkpoint blocked page eviction" : 6,
  "eviction calls to get a page" : 4827751,
  "eviction calls to get a page found queue empty" : 4789295,
  "eviction calls to get a page found queue empty after locking" : 1098,
  "eviction currently operating in aggressive mode" : 0,
  "eviction empty score" : 0,
  "eviction passes of a file" : 38655,
  "eviction server candidate queue empty when topping up" : 1661,
  "eviction server candidate queue not empty when topping up" : 1890,
  "eviction server evicting pages" : 0,
  "eviction server slept, because we did not make progress with eviction" : 1709793,
  "eviction server unable to reach eviction goal" : 0,
  "eviction state" : 32,
  "eviction walk target pages histogram - 0-9" : 37762,
  "eviction walk target pages histogram - 10-31" : 598,
  "eviction walk target pages histogram - 128 and higher" : 0,
  "eviction walk target pages histogram - 32-63" : 162,
  "eviction walk target pages histogram - 64-128" : 133,
  "eviction walks abandoned" : 3314,
  "eviction walks gave up because they restarted their walk twice" : 28244,
  "eviction walks gave up because they saw too many pages and found no candidates" : 1131,
  "eviction walks gave up because they saw too many pages and found too few candidates" : 181,
  "eviction walks reached end of tree" : 62763,
  "eviction walks started from root of tree" : 34971,
  "eviction walks started from saved location in tree" : 3684,
  "eviction worker thread active" : 4,
  "eviction worker thread created" : 0,
  "eviction worker thread evicting pages" : 46955,
  "eviction worker thread removed" : 0,
  "eviction worker thread stable number" : 0,
  "failed eviction of pages that exceeded the in-memory maximum count" : 11,
  "failed eviction of pages that exceeded the in-memory maximum time (usecs)" : 44,
  "files with active eviction walks" : 0,
  "files with new eviction walks started" : 34519,
  "force re-tuning of eviction workers once in a while" : 0,
  "hazard pointer blocked page eviction" : 361,
  "hazard pointer check calls" : 49572,
  "hazard pointer check entries walked" : 85708,
  "hazard pointer maximum array length" : 2,
  "in-memory page passed criteria to be split" : 4086,
  "in-memory page splits" : 1977,
  "internal pages evicted" : 2561,
  "internal pages split during eviction" : 26,
  "leaf pages split during eviction" : 2463,
  "maximum bytes configured" : 1073741824,
  "maximum page size at eviction" : 0,
  "modified pages evicted" : 32533,
  "modified pages evicted by application threads" : 0,
  "operations timed out waiting for space in cache" : 0,
  "overflow pages read into cache" : 0,
  "page split during eviction deepened the tree" : 1,
  "page written requiring cache overflow records" : 0,
  "pages currently held in the cache" : 8975,
  "pages evicted because they exceeded the in-memory maximum count" : 1995,
  "pages evicted because they exceeded the in-memory maximum time (usecs)" : 1163695,
  "pages evicted because they had chains of deleted items count" : 285,
  "pages evicted because they had chains of deleted items time (usecs)" : 25953,
  "pages evicted by application threads" : 0,
  "pages queued for eviction" : 37601,
  "pages queued for urgent eviction" : 9950,
  "pages queued for urgent eviction during walk" : 824,
  "pages read into cache" : 23433,
  "pages read into cache after truncate" : 7634,
  "pages read into cache after truncate in prepare state" : 0,
  "pages read into cache requiring cache overflow entries" : 0,
  "pages read into cache requiring cache overflow for checkpoint" : 0,
  "pages read into cache skipping older cache overflow entries" : 0,
  "pages read into cache with skipped cache overflow entries needed later" : 0,
  "pages read into cache with skipped cache overflow entries needed later by checkpoint" : 0,
  "pages requested from the cache" : 2622634339,
  "pages seen by eviction walk" : 356810,
  "pages selected for eviction unable to be evicted" : 478,
  "pages walked for eviction" : 9543938,
  "pages written from cache" : 44692387,
  "pages written requiring in-memory restoration" : 235,
  "percentage overhead" : 8,
  "tracked bytes belonging to internal pages in the cache" : 2325120,
  "tracked bytes belonging to leaf pages in the cache" : 770874179,
  "tracked dirty bytes in the cache" : 4434563,
  "tracked dirty pages in the cache" : 5,
  "unmodified pages evicted" : 14605
}

Can anyone give some help?



 Comments   
Comment by Edwin Zhou [ 29/Nov/21 ]

Hi 289291934a@gmail.com,

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Best,
Edwin

Comment by Edwin Zhou [ 18/Nov/21 ]

Hi 289291934a@gmail.com,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory and upload them to the support uploader location linked above?

Best,
Edwin

Comment by Edwin Zhou [ 04/Nov/21 ]

Hi 289291934a@gmail.com,

Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Best,
Edwin

Generated at Thu Feb 08 05:51:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.