-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: 4.0.3
-
Component/s: None
-
ALL
-
HI there,I have a group of mongoDB shard cluster in a docker clusters, with 2 mongos members, 1 config server (total: 3 replica members) and 2 shards(each shard also has 3 replica members.). Each member is running in one docker container. Unclear to know why, my primary config server was killed by OOM someday.
I have to increase much more memory for the config server to prevent it from OOM again, even I have given a wiredTiger cache limit like below in the config file. At first time, the config server running ok with 2G memory, but now, it's 8G and has been used almost 99%. And each of the two shard hold only 28G data at most.
# wiredTiger memory limit in config file wiredTigerCacheSizeGB = 1 --- # docker stats $dockerId CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS d-f2uxxxp41y 3.86% 7.943 GiB / 8 GiB 99.29% 0 B / 0 B 39.52 GB / 1.539 TB 0
Here is what I can find in mongod log and db.serverStatus output. And I know the OS may take over all the memory which wiredTiger can't allocate, but I want to know how the OS could use so much and for what purpuse. So where I can figure it out? I hope any "comrade" here can offer some help:
# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 11684 1308 ? Ss 2020 0:00 /bin/bash /entrypoint.sh
root 160 0.0 0.0 20612 804 ? Ss 2020 12:33 crond
root 161 10.8 2.8 15042880 7611028 ? SLl 2020 93735:18 /opt/jmiss_mongo/mongod --config /var/jmiss_mongo/config/configsvr.conf
---
# mongostat
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
*0 9 *0 *0 0 11|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 896b 68.7k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:41.036
*0 2 2 *0 11 16|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 12.7k 80.1k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:42.036
*0 *0 *0 *0 0 11|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 1.21k 70.4k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:43.036
*0 *0 *0 *0 0 9|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 793b 69.6k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:44.036
*0 *0 *0 *0 1 11|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 1.15k 68.2k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:45.035
*0 *0 *0 *0 5 11|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 5.71k 72.2k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:46.036
*0 *0 *0 *0 1 11|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 1.15k 68.5k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:47.035
*0 *0 *0 *0 0 7|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 1.17k 69.4k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:48.036
*0 *0 *0 *0 1 13|0 0.4% 73.3% 0 14.3G 7.24G 0|0 1|0 1.27k 71.0k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:49.034
*0 *0 *0 *0 0 6|0 0.0% 73.3% 1 14.3G 7.24G 0|0 1|0 920b 66.7k 30 mgset-1725952967_configsvr PRI Oct 28 08:16:50.036
mgset-172xxx_configsvr:PRIMARY> show dbs admin 0.000GB config 0.149GB local 2.596GB mgset-172xxx_configsvr:PRIMARY> db.serverStatus().mem { "bits" : 64, "resident" : 7432, "virtual" : 14690, "supported" : true, "mapped" : 0, "mappedWithJournal" : 0 } mgset-172xxx_configsvr:PRIMARY> db.serverStatus().tcmalloc { "generic" : { "current_allocated_bytes" : NumberLong("12324 143616"), "heap_size" : NumberLong("13836734464") }, "tcmalloc" : { "pageheap_free_bytes" : 154 181632, "pageheap_unmapped_bytes" : 1024 569344, "max_total_thread_cache_bytes" : NumberLong(1073 741824), "current_total_thread_cache_bytes" : 24 271352, "total_free_bytes" : 333 839872, "central_cache_free_bytes" : 306 076200, "transfer_cache_free_bytes" : 3 492320, "thread_cache_free_bytes" : 24 271352, "aggressive_memory_decommit" : 0, "pageheap_committed_bytes" : NumberLong("12812 165120"), "pageheap_scavenge_count" : 20 856825, "pageheap_commit_count" : 27 858250, "pageheap_total_commit_bytes" : NumberLong("21789588 615168"), "pageheap_decommit_count" : 20 856825, "pageheap_total_decommit_bytes" : NumberLong("21776776 450048"), "pageheap_reserve_count" : 6709, "pageheap_total_reserve_bytes" : NumberLong("13836 734464"), "spinlock_total_delay_ns" : NumberLong("665771 200416"), ------------------------------------------------ MALLOC: 12290579360 (11721.2 MiB) Bytes in use by application MALLOC: + 187674624 ( 179.0 MiB) Bytes in page heap freelist MALLOC: + 306094552 ( 291.9 MiB) Bytes in central cache freelist MALLOC: + 3133024 ( 3.0 MiB) Bytes in transfer cache freelist MALLOC: + 24618024 ( 23.5 MiB) Bytes in thread cache freelists MALLOC: + 112812288 ( 107.6 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 12924911872 (12326.2 MiB) Actual memory used (physical + swap) MALLOC: + 1024634880 ( 977.2 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 13949546752 (13303.3 MiB) Virtual address space used MALLOC: MALLOC: 1696320 Spans in use MALLOC: 96 Thread heaps in use MALLOC: 4096 Tcmalloc page size ------------------------------------------------ Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). Bytes released to the OS take up virtual address space but no physical memory. } } mgset-172xxx_configsvr:PRIMARY> db.serverStatus().wiredTiger.cache { "application threads page read from disk to cache count" : 23400, "application threads page read from disk to cache time (usecs)" : 3238111, "application threads page write from cache to disk count" : 44687084, "application threads page write from cache to disk time (usecs)" : 1003631736, "bytes belonging to page images in the cache" : 261364892, "bytes belonging to the cache overflow table in the cache" : 182, "bytes currently in the cache" : 773199299, "bytes not belonging to page images in the cache" : 511834406, "bytes read into cache" : 619883360, "bytes written from cache" : 723664179500, "cache overflow cursor application thread wait time (usecs)" : 0, "cache overflow cursor internal thread wait time (usecs)" : 0, "cache overflow score" : 0, "cache overflow table entries" : 0, "cache overflow table insert calls" : 0, "cache overflow table remove calls" : 0, "checkpoint blocked page eviction" : 6, "eviction calls to get a page" : 4827751, "eviction calls to get a page found queue empty" : 4789295, "eviction calls to get a page found queue empty after locking" : 1098, "eviction currently operating in aggressive mode" : 0, "eviction empty score" : 0, "eviction passes of a file" : 38655, "eviction server candidate queue empty when topping up" : 1661, "eviction server candidate queue not empty when topping up" : 1890, "eviction server evicting pages" : 0, "eviction server slept, because we did not make progress with eviction" : 1709793, "eviction server unable to reach eviction goal" : 0, "eviction state" : 32, "eviction walk target pages histogram - 0-9" : 37762, "eviction walk target pages histogram - 10-31" : 598, "eviction walk target pages histogram - 128 and higher" : 0, "eviction walk target pages histogram - 32-63" : 162, "eviction walk target pages histogram - 64-128" : 133, "eviction walks abandoned" : 3314, "eviction walks gave up because they restarted their walk twice" : 28244, "eviction walks gave up because they saw too many pages and found no candidates" : 1131, "eviction walks gave up because they saw too many pages and found too few candidates" : 181, "eviction walks reached end of tree" : 62763, "eviction walks started from root of tree" : 34971, "eviction walks started from saved location in tree" : 3684, "eviction worker thread active" : 4, "eviction worker thread created" : 0, "eviction worker thread evicting pages" : 46955, "eviction worker thread removed" : 0, "eviction worker thread stable number" : 0, "failed eviction of pages that exceeded the in-memory maximum count" : 11, "failed eviction of pages that exceeded the in-memory maximum time (usecs)" : 44, "files with active eviction walks" : 0, "files with new eviction walks started" : 34519, "force re-tuning of eviction workers once in a while" : 0, "hazard pointer blocked page eviction" : 361, "hazard pointer check calls" : 49572, "hazard pointer check entries walked" : 85708, "hazard pointer maximum array length" : 2, "in-memory page passed criteria to be split" : 4086, "in-memory page splits" : 1977, "internal pages evicted" : 2561, "internal pages split during eviction" : 26, "leaf pages split during eviction" : 2463, "maximum bytes configured" : 1073741824, "maximum page size at eviction" : 0, "modified pages evicted" : 32533, "modified pages evicted by application threads" : 0, "operations timed out waiting for space in cache" : 0, "overflow pages read into cache" : 0, "page split during eviction deepened the tree" : 1, "page written requiring cache overflow records" : 0, "pages currently held in the cache" : 8975, "pages evicted because they exceeded the in-memory maximum count" : 1995, "pages evicted because they exceeded the in-memory maximum time (usecs)" : 1163695, "pages evicted because they had chains of deleted items count" : 285, "pages evicted because they had chains of deleted items time (usecs)" : 25953, "pages evicted by application threads" : 0, "pages queued for eviction" : 37601, "pages queued for urgent eviction" : 9950, "pages queued for urgent eviction during walk" : 824, "pages read into cache" : 23433, "pages read into cache after truncate" : 7634, "pages read into cache after truncate in prepare state" : 0, "pages read into cache requiring cache overflow entries" : 0, "pages read into cache requiring cache overflow for checkpoint" : 0, "pages read into cache skipping older cache overflow entries" : 0, "pages read into cache with skipped cache overflow entries needed later" : 0, "pages read into cache with skipped cache overflow entries needed later by checkpoint" : 0, "pages requested from the cache" : 2622634339, "pages seen by eviction walk" : 356810, "pages selected for eviction unable to be evicted" : 478, "pages walked for eviction" : 9543938, "pages written from cache" : 44692387, "pages written requiring in-memory restoration" : 235, "percentage overhead" : 8, "tracked bytes belonging to internal pages in the cache" : 2325120, "tracked bytes belonging to leaf pages in the cache" : 770874179, "tracked dirty bytes in the cache" : 4434563, "tracked dirty pages in the cache" : 5, "unmodified pages evicted" : 14605 }
Can anyone give some help?