[SERVER-34610] One slow query should not block the whole replica set Created: 23/Apr/18  Updated: 24/Apr/18  Resolved: 24/Apr/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Emilien Kenler [X] Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2018-04-23 at 13.52.45.png     Zip Archive diagnostic.data.zip     Zip Archive mongodb_logs.zip     Zip Archive server_logs.zip    
Operating System: ALL
Participants:

 Description   

Last Friday, we encountered an issue around 07:20 UTC (16:20 JST).
The CPU usage of one of the member of our replica set jumped to 100%, and the queries were not evenly distributed between our servers, causing our application to be unavailable.

We would like to have assistance from the MongoDB team to help:

  • finding the root cause of the CPU spike;
  • finding why queries are not evenly distributed between the secondaries;
  • preventing this situation to happen again.

At the time of incident, the primary server was mongodb001-0305.prd.jp1.aws.kkvesper.net, the affected secondary was mongodb001-005a.prd.jp1.aws.kkvesper.net and mongodb001-0c50.prd.jp1.aws.kkvesper.net was an unaffected secondary.

We are using MongoDB 3.6.2 with mongoid 3.1.6. The replica set has 3 members connected to the application and one hidden secondary used for backups.
I'm attaching the logs of all servers and the content of the diagnostic.data folder for the time around the incident.

We believe the high CPU was caused by a query using the wrong index (see mongodb_parsed_005a_2018042007.log at 2018-04-20T07:18:19.692Z).
The query used a compound index instead of the `_id` index, causing the query to examine over 4M keys.

The CPU usage started to increase soon after the query started, and went down only when we manually restarted the MongoDB server. While the query took 5 minutes, we can see that the high CPU usage lasted more than that.

Using mlogvis, it seems a lot of queries were still running when the server has been restarted and that most of the slow queries are those ones.

With mloginfo, I can see there are more queries run on the affected secondaries than on the other servers. Is it an issue with mongoid or the server? I would expect all secondaries to receive the same amount of queries.

Please provide guidance to analyse this issue and prevent it to happen again.

Thank you.



 Comments   
Comment by Kelsey Schubert [ 24/Apr/18 ]

Hi MiLk

Thanks for your report. Please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group.

Kind regards,
Kelsey

Generated at Thu Feb 08 04:37:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.