[SERVER-70568] Latency spikes without any additional load Created: 14/Oct/22 Updated: 31/Oct/22 Resolved: 31/Oct/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.4.10 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Sergey Zagursky | Assignee: | Yuan Fang |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
We have a MongoDB 4.4.10 sharded cluster with 10 shards with PSS topology. From time to time (2 or 3 times a day) we encounter huge latency spike that can't be explained by queries our cluster serving currently. At that moment CPU load on problematic shard primary plummets to near zero. We observe no load increase on IO either. Everything that is currently working on this MongoDB instance starts to perform slowly. Such diagnostic tools as `atop` and `telegraf` also suffer from latency increase. `atop` loses a segment of time at the moment. We run our cluster on AWS EC2 instances but CloudWatch metrics regarding instance health are quiet. We recently upscaled instances in our cluster x2 from r5d.4xlarge to r5d.8xlarge but the problem still persists and its frequency hasn't lowered. We have another MongoDB clusters and other clusters are performing well. I attached an example FTDC log at the moment of latency spike. The exact time is 09:03-09:12 UTC, you can find it by connection count spike. |
| Comments |
| Comment by Yuan Fang [ 31/Oct/22 ] |
|
Hi sz, We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket. Regards, |
| Comment by Yuan Fang [ 17/Oct/22 ] |
|
Hi sz, Thank you for reporting this issue. After discussing with the Server Triage team, I don't think this issue is SERVER-31417 because no significant amounts of memory tcmalloc de-committing were observed during the latency spikes. Unfortunately, I haven't been able to spot a clear cause and didn't find any performance bottlenecks, but it would be worth verifying the secondary node, and the mongodb.log. Additionally, it's not clear from the data so far that tcmallocReleaseRate=0 would help, and we do not typically recommend config tcmallocReleaseRate=0. With all that being said, we still need additional information to diagnose the problem. I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time. For each node (including both the primary and secondary nodes) in the replica set, would you please archive (tar or zip) and upload to that link:
Regards, |
| Comment by Sergey Zagursky [ 17/Oct/22 ] |
|
As for now, after 48 hours, I don't see any problems with `tcmallocReleaseRate` set to 0. What long term problems could this setting inflict? |
| Comment by Sergey Zagursky [ 17/Oct/22 ] |
|
There is a relevant issue: https://jira.mongodb.org/browse/SERVER-31417 |
| Comment by Sergey Zagursky [ 17/Oct/22 ] |
|
I have additional input on our problem. It seems that setting `tcmallocReleaseRate` to 0 effectively stops our latency spikes from happening again. Therefore I suspect we have a pathological workload that makes tcmalloc to behave this way. What could it be? |