Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 5.0.14
Component/s: None
Labels:
None
Environment:
Ubuntu 18.04.6 LTS
XSF
Kernel - 5.4.0-1088-aws #96~18.04.1-Ubuntu SMP Mon Oct 17 02:57:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Disable Transparent Huge disabled
AWS m6i.4xlarge
SSD GP3 450 Gb

Operating System:
ALL
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Hello,

A couple of days ago on one of our shards we had an excessive memory consumption on one of the secondaries (this is not directly connected to the problem proper but was just a trigger to the problem).
We had to reboot the instance, and that cause the load on that instance to go to another secondary, and that one froze due to high memory consumption, and we rebooted it.
We fixed the problem causing high memory consumption, and the instances were starting up.

After they started connections to the replica set were very slow. If I tried to connect to the replica set from my Studio 35 the connection was frozen on the Authentication step for a minute or so, and then I could connect. But generally a connection would go through within a second.

We tried to restart mongod processes, and it didn't help. Then we turned off one of the secondaries (that was just a blind guess), and restarted the processes again, and the shard went back to normal (almost). So, now we're running only on two replica members.

After this incident we're currently having a problem with some delete operations being frozen in a shared collection (where we perform cross shard transactions) but only on that shard.

Here's a full lifecycle of documents in that collection:

we open a transaction within which we change a document in another shared collection (no problem with this one), and insert a document into the collection proper
in another process we read a butch (5000 items) of documents from the collection in question
we process that batch
and delete the batch by _ids using deleteMany

And sometimes the delete operation freezes on the server side, but only on the shard where we had the incident.

We noticed that even though the operation may freeze some files still get deleted.

So, to recap we have the following problems:

We're running only on two replica members in a set (if we turn on the third member, connecting to the replica set gets very slow)
Some delete operations in the aforementioned collection freeze indefinitely

We found nothing in logs that could point at the root cause.

I'm attaching:

example of a frozen delete operation from db.currentOps
diagnostic.data

Our cluster configuration:

shard cluster with 10 shards
three replicas in each shard
about 600 GB of data in storage size per shard

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

current_op_inactive_transaction.json
6 kB
Mar 06 2025 11:22:01 PM UTC
currentOp_frozen_operation_example.json
3 kB
Mar 05 2025 02:36:59 PM UTC
diagnostic.data.tar.gz
54.06 MB
Mar 05 2025 02:37:15 PM UTC

Assignee:: Chris Kelly
Reporter:: Vladimir Belyakov
Participants:: Chris Kelly, Vladimir Belyakov
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Mar 05 2025 02:38:30 PM UTC
Updated:: Mar 06 2025 11:45:02 PM UTC
Resolved:: Mar 06 2025 11:45:02 PM UTC

Details

Description

Attachments

Attachments

Activity

People

Dates