Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Gone away
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 8.0.17
Component/s: None
Labels:
None

Operating System:
ALL
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Setup with a sharded replicaset with 24 shards spread accross 9 VMs (sh.status attached) with mongo-server 8.0.17:

be09-be rxxxx:/# dpkg -l | grep mongodb-org
ii  mongodb-org-mongos             8.0.17                         amd64        MongoDB sharded cluster query router
ii  mongodb-org-server             8.0.17

There is a repeated pattern where a secondary that is otherwise consuming stable levels of memory starts consuming memory linearly up until it exhaust the VM available memory (see screenshot plotting RSS of an affected mongo process).

The time this increase starts aligns with daily VM snapshots done on the underlying infrastructure.

Note: This is only occurring on secondaries and is random (i.e. does not happen every day nor to the same secondaries)

We have excluded application as the RSS is much higher than WT cache (db.serverStatus output attached)

BE09   

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2603361 mongodb   20   0   12.4g   8.5g  76672 S  27.7  13.7    4d+23h /usr/bin/mongod --config /etc/mongod-cpe_shard19.conf
2606513 mongodb   20   0 9968136   3.9g  75136 S  32.7   6.2     5d+2h /usr/bin/mongod --config /etc/mongod-cpe_shard24.conf

db.serverStatus().wiredTiger.cache["bytes currently in the cache"] = 1020866997

The correlation we found to this sudden memory increase is an abnormal increase in db.serverStatus().wiredTiger.cursor['open cursor count'].

This value is abnormally high for the secondary when the memory is in that linear increase phase and continues to increase together with used memory.

This might not be correlated but looking at this mongo process' logs (time of the logs is UTC+8 vs chart in UTC) it seems that around the time the memory starts there were changes in the replicaset primary (likely caused by the VM snapshots).

mongod-cpe_shard19.log.1:{"t":{"$date":"2026-01-29T00:09:27.357+08:00"},"s":"I",  "c":"NETWORK",  "id":6006301, "svc":"-", "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Replica set primary server change detected","attr":{"replicaSet":"lrs19","topologyType":"ReplicaSetWithPrimary","primary":"be09-internal:37039","durationMillis":167232791}}
mongod-cpe_shard19.log.1:{"t":{"$date":"2026-01-29T00:12:55.429+08:00"},"s":"I",  "c":"NETWORK",  "id":6006301, "svc":"-", "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Replica set primary server change detected","attr":{"replicaSet":"lrs19","topologyType":"ReplicaSetWithPrimary","primary":"be08-internal:37039","durationMillis":208072}}
mongod-cpe_shard19.log.1:{"t":{"$date":"2026-01-29T00:13:17.200+08:00"},"s":"I",  "c":"NETWORK",  "id":6006301, "svc":"-", "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Replica set primary server change detected","attr":{"replicaSet":"lrs19","topologyType":"ReplicaSetNoPrimary","primary":"Unknown","durationMillis":21771}}
mongod-cpe_shard19.log.1:{"t":{"$date":"2026-01-29T00:13:18.448+08:00"},"s":"I",  "c":"NETWORK",  "id":6006301, "svc":"-", "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Replica set primary server change detected","attr":{"replicaSet":"lrs19","topologyType":"ReplicaSetWithPrimary","primary":"be07-internal:37039","durationMillis":1247}}

Attached:

sh.status;
top on all servers sorted by memory;

db.serverStatus and db.currentOp(true) from the affected mongo node when in memory increase phase
grafana chart showing memory process RSS memory increase;
pmap output of the affected process (pid 2603361)
mongod-cpe_shard19.conf

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

baseOS_pid_memstats.txt
63 kB
Feb 03 2026 08:05:37 AM UTC
currentops_be09_lrs19_29012026.txt
96 kB
Feb 03 2026 08:24:47 AM UTC
lrs39_port37039.png
163 kB
Feb 03 2026 07:56:47 AM UTC
mem_29012026.txt
10 kB
Feb 03 2026 08:05:48 AM UTC
mongod-cpe_shard19.conf
0.7 kB
Feb 03 2026 08:30:34 AM UTC
server_status_be09_lrs19_29012026.txt
135 kB
Feb 03 2026 08:02:34 AM UTC
sh_status.txt
27 kB
Feb 03 2026 07:55:09 AM UTC

is related to

SERVER-110832 OplogWriter should not hold on to session through rollbacks

Closed

Assignee:: Unassigned
Reporter:: Diogo Leite
Participants:: Benjamin Ulmer, Diogo Leite
Votes:: 1 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Feb 03 2026 08:32:09 AM UTC
Updated:: Mar 18 2026 05:26:51 PM UTC
Resolved:: Mar 18 2026 05:26:51 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates