-
Type:
Bug
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: 5.0.14
-
Component/s: None
-
None
-
Environment:Ubuntu 18.04.6 LTS
XSF
Kernel - 5.4.0-1088-aws #96~18.04.1-Ubuntu SMP Mon Oct 17 02:57:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Disable Transparent Huge disabled
AWS m6i.4xlarge
SSD GP3 450 Gb
-
ALL
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
Hello,
A couple of days ago on one of our shards we had an excessive memory consumption on one of the secondaries (this is not directly connected to the problem proper but was just a trigger to the problem).
We had to reboot the instance, and that cause the load on that instance to go to another secondary, and that one froze due to high memory consumption, and we rebooted it.
We fixed the problem causing high memory consumption, and the instances were starting up.
After they started connections to the replica set were very slow. If I tried to connect to the replica set from my Studio 35 the connection was frozen on the Authentication step for a minute or so, and then I could connect. But generally a connection would go through within a second.
We tried to restart mongod processes, and it didn't help. Then we turned off one of the secondaries (that was just a blind guess), and restarted the processes again, and the shard went back to normal (almost). So, now we're running only on two replica members.
After this incident we're currently having a problem with some delete operations being frozen in a shared collection (where we perform cross shard transactions) but only on that shard.
Here's a full lifecycle of documents in that collection:
- we open a transaction within which we change a document in another shared collection (no problem with this one), and insert a document into the collection proper
- in another process we read a butch (5000 items) of documents from the collection in question
- we process that batch
- and delete the batch by _ids using deleteMany
And sometimes the delete operation freezes on the server side, but only on the shard where we had the incident.
We noticed that even though the operation may freeze some files still get deleted.
So, to recap we have the following problems:
- We're running only on two replica members in a set (if we turn on the third member, connecting to the replica set gets very slow)
- Some delete operations in the aforementioned collection freeze indefinitely
We found nothing in logs that could point at the root cause.
I'm attaching:
- example of a frozen delete operation from db.currentOps
- diagnostic.data
Our cluster configuration:
- shard cluster with 10 shards
- three replicas in each shard
- about 600 GB of data in storage size per shard