-
Type:
Bug
-
Resolution: Gone away
-
Priority:
Major - P3
-
None
-
Affects Version/s: 8.0.17
-
Component/s: None
-
None
-
ALL
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Setup with a sharded replicaset with 24 shards spread accross 9 VMs (sh.status attached) with mongo-server 8.0.17:
be09-be rxxxx:/# dpkg -l | grep mongodb-org ii mongodb-org-mongos 8.0.17 amd64 MongoDB sharded cluster query router ii mongodb-org-server 8.0.17
There is a repeated pattern where a secondary that is otherwise consuming stable levels of memory starts consuming memory linearly up until it exhaust the VM available memory (see screenshot plotting RSS of an affected mongo process).
The time this increase starts aligns with daily VM snapshots done on the underlying infrastructure.
Note: This is only occurring on secondaries and is random (i.e. does not happen every day nor to the same secondaries)
We have excluded application as the RSS is much higher than WT cache (db.serverStatus output attached)
BE09
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2603361 mongodb 20 0 12.4g 8.5g 76672 S 27.7 13.7 4d+23h /usr/bin/mongod --config /etc/mongod-cpe_shard19.conf
2606513 mongodb 20 0 9968136 3.9g 75136 S 32.7 6.2 5d+2h /usr/bin/mongod --config /etc/mongod-cpe_shard24.conf
db.serverStatus().wiredTiger.cache["bytes currently in the cache"] = 1020866997
The correlation we found to this sudden memory increase is an abnormal increase in db.serverStatus().wiredTiger.cursor['open cursor count'].
This value is abnormally high for the secondary when the memory is in that linear increase phase and continues to increase together with used memory.
This might not be correlated but looking at this mongo process' logs (time of the logs is UTC+8 vs chart in UTC) it seems that around the time the memory starts there were changes in the replicaset primary (likely caused by the VM snapshots).
mongod-cpe_shard19.log.1:{"t":{"$date":"2026-01-29T00:09:27.357+08:00"},"s":"I", "c":"NETWORK", "id":6006301, "svc":"-", "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Replica set primary server change detected","attr":{"replicaSet":"lrs19","topologyType":"ReplicaSetWithPrimary","primary":"be09-internal:37039","durationMillis":167232791}}
mongod-cpe_shard19.log.1:{"t":{"$date":"2026-01-29T00:12:55.429+08:00"},"s":"I", "c":"NETWORK", "id":6006301, "svc":"-", "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Replica set primary server change detected","attr":{"replicaSet":"lrs19","topologyType":"ReplicaSetWithPrimary","primary":"be08-internal:37039","durationMillis":208072}}
mongod-cpe_shard19.log.1:{"t":{"$date":"2026-01-29T00:13:17.200+08:00"},"s":"I", "c":"NETWORK", "id":6006301, "svc":"-", "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Replica set primary server change detected","attr":{"replicaSet":"lrs19","topologyType":"ReplicaSetNoPrimary","primary":"Unknown","durationMillis":21771}}
mongod-cpe_shard19.log.1:{"t":{"$date":"2026-01-29T00:13:18.448+08:00"},"s":"I", "c":"NETWORK", "id":6006301, "svc":"-", "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Replica set primary server change detected","attr":{"replicaSet":"lrs19","topologyType":"ReplicaSetWithPrimary","primary":"be07-internal:37039","durationMillis":1247}}
Attached:
- sh.status;
- top on all servers sorted by memory;
- db.serverStatus and db.currentOp(true) from the affected mongo node when in memory increase phase
- grafana chart showing memory process RSS memory increase;
- pmap output of the affected process (pid 2603361)
- mongod-cpe_shard19.conf
- is related to
-
SERVER-110832 OplogWriter should not hold on to session through rollbacks
-
- Closed
-