-
Type:
Bug
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: 4.0.3
-
Component/s: None
-
ALL
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
HI there,I have a group of mongoDB shard cluster in a docker clusters, with 2 mongos members, 1 config server (total: 3 replica members) and 2 shards(each shard also has 3 replica members.). Each member is running in one docker container. Unclear to know why, my primary config server was killed by OOM someday.
I have to increase much more memory for the config server to prevent it from OOM again, even I have given a wiredTiger cache limit like below in the config file. At first time, the config server running ok with 2G memory, but now, it's 8G and has been used almost 99%. And each of the two shard hold only 28G data at most.
# wiredTiger memory limit in config file wiredTigerCacheSizeGB = 1 --- # docker stats $dockerId CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS d-f2uxxxp41y 3.86% 7.943 GiB / 8 GiB 99.29% 0 B / 0 B 39.52 GB / 1.539 TB 0
Here is what I can find in mongod log and db.serverStatus output. And I know the OS may take over all the memory which wiredTiger can't allocate, but I want to know how the OS could use so much and for what purpuse. So where I can figure it out? I hope any "comrade" here can offer some help:
# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 11684 1308 ? Ss 2020 0:00 /bin/bash /entrypoint.sh
root 160 0.0 0.0 20612 804 ? Ss 2020 12:33 crond
root 161 10.8 2.8 15042880 7611028 ? SLl 2020 93735:18 /opt/jmiss_mongo/mongod --config /var/jmiss_mongo/config/configsvr.conf
---
# mongostat
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
*0 9 *0 *0 0 11|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 896b 68.7k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:41.036
*0 2 2 *0 11 16|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 12.7k 80.1k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:42.036
*0 *0 *0 *0 0 11|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 1.21k 70.4k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:43.036
*0 *0 *0 *0 0 9|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 793b 69.6k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:44.036
*0 *0 *0 *0 1 11|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 1.15k 68.2k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:45.035
*0 *0 *0 *0 5 11|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 5.71k 72.2k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:46.036
*0 *0 *0 *0 1 11|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 1.15k 68.5k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:47.035
*0 *0 *0 *0 0 7|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 1.17k 69.4k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:48.036
*0 *0 *0 *0 1 13|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 1.27k 71.0k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:49.034
*0 *0 *0 *0 0 6|0 0.0% 73.3% 1 14.3G 7.24G 0|0 1|0 920b 66.7k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:50.036
mgset-172xxx_configsvr:PRIMARY> show dbs
admin 0.000GB
config 0.149GB
local 2.596GB
mgset-172xxx_configsvr:PRIMARY> db.serverStatus().mem
{
"bits" : 64,
"resident" : 7432,
"virtual" : 14690,
"supported" : true,
"mapped" : 0,
"mappedWithJournal" : 0
}
mgset-172xxx_configsvr:PRIMARY> db.serverStatus().tcmalloc
{
"generic" : {
"current_allocated_bytes" : NumberLong("12324 143616"),
"heap_size" : NumberLong("13836734464")
},
"tcmalloc" : {
"pageheap_free_bytes" : 154 181632,
"pageheap_unmapped_bytes" : 1024 569344,
"max_total_thread_cache_bytes" : NumberLong(1073 741824),
"current_total_thread_cache_bytes" : 24 271352,
"total_free_bytes" : 333 839872,
"central_cache_free_bytes" : 306 076200,
"transfer_cache_free_bytes" : 3 492320,
"thread_cache_free_bytes" : 24 271352,
"aggressive_memory_decommit" : 0,
"pageheap_committed_bytes" : NumberLong("12812 165120"),
"pageheap_scavenge_count" : 20 856825,
"pageheap_commit_count" : 27 858250,
"pageheap_total_commit_bytes" : NumberLong("21789588 615168"),
"pageheap_decommit_count" : 20 856825,
"pageheap_total_decommit_bytes" : NumberLong("21776776 450048"),
"pageheap_reserve_count" : 6709,
"pageheap_total_reserve_bytes" : NumberLong("13836 734464"),
"spinlock_total_delay_ns" : NumberLong("665771 200416"),
------------------------------------------------
MALLOC: 12290579360 (11721.2 MiB) Bytes in use by application
MALLOC: + 187674624 ( 179.0 MiB) Bytes in page heap freelist
MALLOC: + 306094552 ( 291.9 MiB) Bytes in central cache freelist
MALLOC: + 3133024 ( 3.0 MiB) Bytes in transfer cache freelist
MALLOC: + 24618024 ( 23.5 MiB) Bytes in thread cache freelists
MALLOC: + 112812288 ( 107.6 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 12924911872 (12326.2 MiB) Actual memory used (physical + swap)
MALLOC: + 1024634880 ( 977.2 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 13949546752 (13303.3 MiB) Virtual address space used
MALLOC:
MALLOC: 1696320 Spans in use
MALLOC: 96 Thread heaps in use
MALLOC: 4096 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
}
}
mgset-172xxx_configsvr:PRIMARY> db.serverStatus().wiredTiger.cache
{
"application threads page read from disk to cache count" : 23400,
"application threads page read from disk to cache time (usecs)" : 3238111,
"application threads page write from cache to disk count" : 44687084,
"application threads page write from cache to disk time (usecs)" : 1003631736,
"bytes belonging to page images in the cache" : 261364892,
"bytes belonging to the cache overflow table in the cache" : 182,
"bytes currently in the cache" : 773199299,
"bytes not belonging to page images in the cache" : 511834406,
"bytes read into cache" : 619883360,
"bytes written from cache" : 723664179500,
"cache overflow cursor application thread wait time (usecs)" : 0,
"cache overflow cursor internal thread wait time (usecs)" : 0,
"cache overflow score" : 0,
"cache overflow table entries" : 0,
"cache overflow table insert calls" : 0,
"cache overflow table remove calls" : 0,
"checkpoint blocked page eviction" : 6,
"eviction calls to get a page" : 4827751,
"eviction calls to get a page found queue empty" : 4789295,
"eviction calls to get a page found queue empty after locking" : 1098,
"eviction currently operating in aggressive mode" : 0,
"eviction empty score" : 0,
"eviction passes of a file" : 38655,
"eviction server candidate queue empty when topping up" : 1661,
"eviction server candidate queue not empty when topping up" : 1890,
"eviction server evicting pages" : 0,
"eviction server slept, because we did not make progress with eviction" : 1709793,
"eviction server unable to reach eviction goal" : 0,
"eviction state" : 32,
"eviction walk target pages histogram - 0-9" : 37762,
"eviction walk target pages histogram - 10-31" : 598,
"eviction walk target pages histogram - 128 and higher" : 0,
"eviction walk target pages histogram - 32-63" : 162,
"eviction walk target pages histogram - 64-128" : 133,
"eviction walks abandoned" : 3314,
"eviction walks gave up because they restarted their walk twice" : 28244,
"eviction walks gave up because they saw too many pages and found no candidates" : 1131,
"eviction walks gave up because they saw too many pages and found too few candidates" : 181,
"eviction walks reached end of tree" : 62763,
"eviction walks started from root of tree" : 34971,
"eviction walks started from saved location in tree" : 3684,
"eviction worker thread active" : 4,
"eviction worker thread created" : 0,
"eviction worker thread evicting pages" : 46955,
"eviction worker thread removed" : 0,
"eviction worker thread stable number" : 0,
"failed eviction of pages that exceeded the in-memory maximum count" : 11,
"failed eviction of pages that exceeded the in-memory maximum time (usecs)" : 44,
"files with active eviction walks" : 0,
"files with new eviction walks started" : 34519,
"force re-tuning of eviction workers once in a while" : 0,
"hazard pointer blocked page eviction" : 361,
"hazard pointer check calls" : 49572,
"hazard pointer check entries walked" : 85708,
"hazard pointer maximum array length" : 2,
"in-memory page passed criteria to be split" : 4086,
"in-memory page splits" : 1977,
"internal pages evicted" : 2561,
"internal pages split during eviction" : 26,
"leaf pages split during eviction" : 2463,
"maximum bytes configured" : 1073741824,
"maximum page size at eviction" : 0,
"modified pages evicted" : 32533,
"modified pages evicted by application threads" : 0,
"operations timed out waiting for space in cache" : 0,
"overflow pages read into cache" : 0,
"page split during eviction deepened the tree" : 1,
"page written requiring cache overflow records" : 0,
"pages currently held in the cache" : 8975,
"pages evicted because they exceeded the in-memory maximum count" : 1995,
"pages evicted because they exceeded the in-memory maximum time (usecs)" : 1163695,
"pages evicted because they had chains of deleted items count" : 285,
"pages evicted because they had chains of deleted items time (usecs)" : 25953,
"pages evicted by application threads" : 0,
"pages queued for eviction" : 37601,
"pages queued for urgent eviction" : 9950,
"pages queued for urgent eviction during walk" : 824,
"pages read into cache" : 23433,
"pages read into cache after truncate" : 7634,
"pages read into cache after truncate in prepare state" : 0,
"pages read into cache requiring cache overflow entries" : 0,
"pages read into cache requiring cache overflow for checkpoint" : 0,
"pages read into cache skipping older cache overflow entries" : 0,
"pages read into cache with skipped cache overflow entries needed later" : 0,
"pages read into cache with skipped cache overflow entries needed later by checkpoint" : 0,
"pages requested from the cache" : 2622634339,
"pages seen by eviction walk" : 356810,
"pages selected for eviction unable to be evicted" : 478,
"pages walked for eviction" : 9543938,
"pages written from cache" : 44692387,
"pages written requiring in-memory restoration" : 235,
"percentage overhead" : 8,
"tracked bytes belonging to internal pages in the cache" : 2325120,
"tracked bytes belonging to leaf pages in the cache" : 770874179,
"tracked dirty bytes in the cache" : 4434563,
"tracked dirty pages in the cache" : 5,
"unmodified pages evicted" : 14605
}
Can anyone give some help?