[SERVER-70568] Latency spikes without any additional load Created: 14/Oct/22  Updated: 31/Oct/22  Resolved: 31/Oct/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sergey Zagursky Assignee: Yuan Fang
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File metrics.2022-10-13T08-12-17Z-00000    
Operating System: ALL
Participants:

 Description   

We have a MongoDB 4.4.10 sharded cluster with 10 shards with PSS topology. From time to time (2 or 3 times a day) we encounter huge latency spike that can't be explained by queries our cluster serving currently. At that moment CPU load on problematic shard primary plummets to near zero. We observe no load increase on IO either. Everything that is currently working on this MongoDB instance starts to perform slowly. Such diagnostic tools as `atop` and `telegraf` also suffer from latency increase. `atop` loses a segment of time at the moment.

We run our cluster on AWS EC2 instances but CloudWatch metrics regarding instance health are quiet. We recently upscaled instances in our cluster x2 from r5d.4xlarge to r5d.8xlarge but the problem still persists and its frequency hasn't lowered. We have another MongoDB clusters and other clusters are performing well.

I attached an example FTDC log at the moment of latency spike. The exact time is 09:03-09:12 UTC, you can find it by connection count spike.



 Comments   
Comment by Yuan Fang [ 31/Oct/22 ]

Hi sz,

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Yuan

Comment by Yuan Fang [ 17/Oct/22 ]

Hi sz,

Thank you for reporting this issue. After discussing with the Server Triage team, I don't think this issue is SERVER-31417 because no significant amounts of memory tcmalloc de-committing were observed during the latency spikes. Unfortunately, I haven't been able to spot a clear cause and didn't find any performance bottlenecks, but it would be worth verifying the secondary node, and the mongodb.log. Additionally, it's not clear from the data so far that tcmallocReleaseRate=0 would help, and we do not typically recommend config tcmallocReleaseRate=0.

With all that being said, we still need additional information to diagnose the problem.

I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

For each node (including both the primary and secondary nodes) in the replica set, would you please archive (tar or zip) and upload to that link:

  • Full mongod.log and diagnostic.data include the spikes of latencies, with tcmallocReleaseRate=1(default).
  • Full mongod.log and diagnostic.data for a time when you believe the latencies should have occurred but did not, with default tcmallocReleaseRate=0.

Regards,
Yuan

Comment by Sergey Zagursky [ 17/Oct/22 ]

As for now, after 48 hours, I don't see any problems with `tcmallocReleaseRate` set to 0. What long term problems could this setting inflict?

Comment by Sergey Zagursky [ 17/Oct/22 ]

There is a relevant issue: https://jira.mongodb.org/browse/SERVER-31417

Comment by Sergey Zagursky [ 17/Oct/22 ]

I have additional input on our problem. It seems that setting `tcmallocReleaseRate` to 0 effectively stops our latency spikes from happening again. Therefore I suspect we have a pathological workload that makes tcmalloc to behave this way. What could it be?

Generated at Thu Feb 08 06:16:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.