Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-61150

Primay Config Server cost too much memory and leads to occur OOM

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 4.0.3
    • Component/s: None
    • ALL
    • Hide

      I can't reproduce that perfectly. 

      Show
      I can't reproduce that perfectly. 

      HI there,I have a group of mongoDB shard cluster in a docker clusters, with 2 mongos members, 1 config server (total: 3 replica members) and 2 shards(each shard also has 3 replica members.). Each member is running in one docker container. Unclear to know why, my primary config server was killed by OOM someday.

      I have to increase much more memory for the config server to prevent it from OOM again, even I have given a wiredTiger cache limit like below in the config file. At first time, the config server running ok with 2G memory, but now, it's 8G and has been used almost 99%. And each of the two shard hold only 28G data at most.

      # wiredTiger memory limit in  config file
      wiredTigerCacheSizeGB = 1
      ---
      # docker stats $dockerId
      CONTAINER           CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O             PIDS
      d-f2uxxxp41y        3.86%               7.943 GiB / 8 GiB   99.29%              0 B / 0 B           39.52 GB / 1.539 TB   0
      

      Here is what I can find in mongod log and db.serverStatus output. And I know the OS may take over all the memory which wiredTiger can't allocate, but I want to know how the OS could use so much and for what purpuse. So where I can figure it out? I hope any "comrade" here can offer some help:

       

      # ps aux
      USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
      root         1  0.0  0.0  11684  1308 ?        Ss    2020   0:00 /bin/bash /entrypoint.sh
      root       160  0.0  0.0  20612   804 ?        Ss    2020  12:33 crond
      root       161 10.8  2.8 15042880 7611028 ?    SLl   2020 93735:18 /opt/jmiss_mongo/mongod --config /var/jmiss_mongo/config/configsvr.conf
      ---
      # mongostat
      insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn                        set repl                time
          *0     9     *0     *0       0    11|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0   896b   68.7k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:41.036
          *0     2      2     *0      11    16|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0  12.7k   80.1k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:42.036
          *0    *0     *0     *0       0    11|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0  1.21k   70.4k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:43.036
          *0    *0     *0     *0       0     9|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0   793b   69.6k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:44.036
          *0    *0     *0     *0       1    11|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0  1.15k   68.2k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:45.035
          *0    *0     *0     *0       5    11|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0  5.71k   72.2k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:46.036
          *0    *0     *0     *0       1    11|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0  1.15k   68.5k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:47.035
          *0    *0     *0     *0       0     7|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0  1.17k   69.4k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:48.036
          *0    *0     *0     *0       1    13|0  0.4% 73.3%       0 14.3G 7.24G 0|0 1|0  1.27k   71.0k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:49.034
          *0    *0     *0     *0       0     6|0  0.0% 73.3%       1 14.3G 7.24G 0|0 1|0   920b   66.7k   30 mgset-1725952967_configsvr  PRI Oct 28 08:16:50.036
      
      

       

       

      mgset-172xxx_configsvr:PRIMARY> show dbs
      admin   0.000GB
      config  0.149GB
      local   2.596GB
      mgset-172xxx_configsvr:PRIMARY> db.serverStatus().mem
      {
        "bits" : 64,
        "resident" : 7432,
        "virtual" : 14690,
        "supported" : true,
        "mapped" : 0,
        "mappedWithJournal" : 0
      }
      mgset-172xxx_configsvr:PRIMARY> db.serverStatus().tcmalloc
      {
        "generic" : {
            "current_allocated_bytes" : NumberLong("12324 143616"),
            "heap_size" : NumberLong("13836734464")
        },
        "tcmalloc" : {
            "pageheap_free_bytes" : 154 181632,
            "pageheap_unmapped_bytes" : 1024 569344,
            "max_total_thread_cache_bytes" : NumberLong(1073 741824),
            "current_total_thread_cache_bytes" : 24 271352,
            "total_free_bytes" : 333 839872,
            "central_cache_free_bytes" : 306 076200,
            "transfer_cache_free_bytes" : 3 492320,
            "thread_cache_free_bytes" : 24 271352,
            "aggressive_memory_decommit" : 0,
            "pageheap_committed_bytes" : NumberLong("12812 165120"),
            "pageheap_scavenge_count" : 20 856825,
            "pageheap_commit_count" : 27 858250,
            "pageheap_total_commit_bytes" : NumberLong("21789588 615168"),
            "pageheap_decommit_count" : 20 856825,
            "pageheap_total_decommit_bytes" : NumberLong("21776776 450048"),
            "pageheap_reserve_count" : 6709,
            "pageheap_total_reserve_bytes" : NumberLong("13836 734464"),
            "spinlock_total_delay_ns" : NumberLong("665771 200416"),
      
            ------------------------------------------------
            MALLOC:    12290579360 (11721.2 MiB) Bytes in use by application
            MALLOC: +    187674624 (  179.0 MiB) Bytes in page heap freelist
            MALLOC: +    306094552 (  291.9 MiB) Bytes in central cache freelist
            MALLOC: +      3133024 (    3.0 MiB) Bytes in transfer cache freelist
            MALLOC: +     24618024 (   23.5 MiB) Bytes in thread cache freelists
            MALLOC: +    112812288 (  107.6 MiB) Bytes in malloc metadata
            MALLOC:   ------------
            MALLOC: =  12924911872 (12326.2 MiB) Actual memory used (physical + swap)
            MALLOC: +   1024634880 (  977.2 MiB) Bytes released to OS (aka unmapped)
            MALLOC:   ------------
            MALLOC: =  13949546752 (13303.3 MiB) Virtual address space used
            MALLOC:
            MALLOC:        1696320              Spans in use
            MALLOC:             96              Thread heaps in use
            MALLOC:           4096              Tcmalloc page size
            ------------------------------------------------
            Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
            Bytes released to the OS take up virtual address space but no physical memory.
      
        }
      }
      mgset-172xxx_configsvr:PRIMARY> db.serverStatus().wiredTiger.cache
      {
        "application threads page read from disk to cache count" : 23400,
        "application threads page read from disk to cache time (usecs)" : 3238111,
        "application threads page write from cache to disk count" : 44687084,
        "application threads page write from cache to disk time (usecs)" : 1003631736,
        "bytes belonging to page images in the cache" : 261364892,
        "bytes belonging to the cache overflow table in the cache" : 182,
        "bytes currently in the cache" : 773199299,
        "bytes not belonging to page images in the cache" : 511834406,
        "bytes read into cache" : 619883360,
        "bytes written from cache" : 723664179500,
        "cache overflow cursor application thread wait time (usecs)" : 0,
        "cache overflow cursor internal thread wait time (usecs)" : 0,
        "cache overflow score" : 0,
        "cache overflow table entries" : 0,
        "cache overflow table insert calls" : 0,
        "cache overflow table remove calls" : 0,
        "checkpoint blocked page eviction" : 6,
        "eviction calls to get a page" : 4827751,
        "eviction calls to get a page found queue empty" : 4789295,
        "eviction calls to get a page found queue empty after locking" : 1098,
        "eviction currently operating in aggressive mode" : 0,
        "eviction empty score" : 0,
        "eviction passes of a file" : 38655,
        "eviction server candidate queue empty when topping up" : 1661,
        "eviction server candidate queue not empty when topping up" : 1890,
        "eviction server evicting pages" : 0,
        "eviction server slept, because we did not make progress with eviction" : 1709793,
        "eviction server unable to reach eviction goal" : 0,
        "eviction state" : 32,
        "eviction walk target pages histogram - 0-9" : 37762,
        "eviction walk target pages histogram - 10-31" : 598,
        "eviction walk target pages histogram - 128 and higher" : 0,
        "eviction walk target pages histogram - 32-63" : 162,
        "eviction walk target pages histogram - 64-128" : 133,
        "eviction walks abandoned" : 3314,
        "eviction walks gave up because they restarted their walk twice" : 28244,
        "eviction walks gave up because they saw too many pages and found no candidates" : 1131,
        "eviction walks gave up because they saw too many pages and found too few candidates" : 181,
        "eviction walks reached end of tree" : 62763,
        "eviction walks started from root of tree" : 34971,
        "eviction walks started from saved location in tree" : 3684,
        "eviction worker thread active" : 4,
        "eviction worker thread created" : 0,
        "eviction worker thread evicting pages" : 46955,
        "eviction worker thread removed" : 0,
        "eviction worker thread stable number" : 0,
        "failed eviction of pages that exceeded the in-memory maximum count" : 11,
        "failed eviction of pages that exceeded the in-memory maximum time (usecs)" : 44,
        "files with active eviction walks" : 0,
        "files with new eviction walks started" : 34519,
        "force re-tuning of eviction workers once in a while" : 0,
        "hazard pointer blocked page eviction" : 361,
        "hazard pointer check calls" : 49572,
        "hazard pointer check entries walked" : 85708,
        "hazard pointer maximum array length" : 2,
        "in-memory page passed criteria to be split" : 4086,
        "in-memory page splits" : 1977,
        "internal pages evicted" : 2561,
        "internal pages split during eviction" : 26,
        "leaf pages split during eviction" : 2463,
        "maximum bytes configured" : 1073741824,
        "maximum page size at eviction" : 0,
        "modified pages evicted" : 32533,
        "modified pages evicted by application threads" : 0,
        "operations timed out waiting for space in cache" : 0,
        "overflow pages read into cache" : 0,
        "page split during eviction deepened the tree" : 1,
        "page written requiring cache overflow records" : 0,
        "pages currently held in the cache" : 8975,
        "pages evicted because they exceeded the in-memory maximum count" : 1995,
        "pages evicted because they exceeded the in-memory maximum time (usecs)" : 1163695,
        "pages evicted because they had chains of deleted items count" : 285,
        "pages evicted because they had chains of deleted items time (usecs)" : 25953,
        "pages evicted by application threads" : 0,
        "pages queued for eviction" : 37601,
        "pages queued for urgent eviction" : 9950,
        "pages queued for urgent eviction during walk" : 824,
        "pages read into cache" : 23433,
        "pages read into cache after truncate" : 7634,
        "pages read into cache after truncate in prepare state" : 0,
        "pages read into cache requiring cache overflow entries" : 0,
        "pages read into cache requiring cache overflow for checkpoint" : 0,
        "pages read into cache skipping older cache overflow entries" : 0,
        "pages read into cache with skipped cache overflow entries needed later" : 0,
        "pages read into cache with skipped cache overflow entries needed later by checkpoint" : 0,
        "pages requested from the cache" : 2622634339,
        "pages seen by eviction walk" : 356810,
        "pages selected for eviction unable to be evicted" : 478,
        "pages walked for eviction" : 9543938,
        "pages written from cache" : 44692387,
        "pages written requiring in-memory restoration" : 235,
        "percentage overhead" : 8,
        "tracked bytes belonging to internal pages in the cache" : 2325120,
        "tracked bytes belonging to leaf pages in the cache" : 770874179,
        "tracked dirty bytes in the cache" : 4434563,
        "tracked dirty pages in the cache" : 5,
        "unmodified pages evicted" : 14605
      }
      

      Can anyone give some help?

            Assignee:
            edwin.zhou@mongodb.com Edwin Zhou
            Reporter:
            289291934a@gmail.com A A
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: