[SERVER-39658] Sudden CPU spike in secondary instances with no apparent cause Created: 19/Feb/19  Updated: 06/Dec/22  Resolved: 20/Feb/19

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Santiago Ciciliani Assignee: Backlog - Triage Team
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File image-2019-02-19-04-44-10-051.png     PNG File image-2019-02-19-04-45-04-009.png     PNG File image-2019-02-19-04-52-58-249.png     JPEG File mongodb-24-query-compare.JPG    
Assigned Teams:
Server Triage
Participants:

 Description   

Out of nowhere the two secondary instances for a three node cluster with very little CPU usage spiked to 100% and remained high.

Version: 2.4.14

Things we did so far:

  • Rebuilt all three databases from replicas.
  • reIndex()

Things we discovered:

Now that the load is higher and avg response time is >100ms than usual we see some query that should be using index not using it.

Tue Feb 19 12:47:27.946 [conn11958] query locator.pages query: { query: { _depth:

{ $gte: 1, $lte: 1 }

, _site: "sample.com" }, orderby: { _id: 1 } } ntoreturn:0 ntoskip:0 nscanned:2106803 keyUpdates:0 numYields: 11 locks(micros) r:5514254 nreturned:30 reslen:130866 2951ms

 

Screenshots

 

Secondary Node 1 - The other follows the same pattern.

Primary Node (mostly idle except for nightly batch loads)



 Comments   
Comment by Santiago Ciciliani [ 20/Feb/19 ]

Hi Eric, thanks for your response. I am aware that MongoDB 2.4 is EoL and we are in the process of analyzing upgrade options.

In the meantime, do you know if there is a workaround we could apply to correct the plan choice and get the avg load back to normal?

I'm puzzled by the fact that the server restored from the snapshot is choosing the right plan considering that the snapshot was taken after the issue started happening.

Thanks

Comment by Eric Sedor [ 20/Feb/19 ]

Hi sctrilogy,

We believe you've found the likely reason for the CPU use, which is a common symptom of a sudden poor plan choice. Because of how the MongoDB query planner works it is possible for index choice to change. As well, several bugs involving poor plan choice have been corrected since MongoDB 2.4. A recent major improvement was SERVER-20139 in MongoDB 3.0.

Unfortunately MongoDB 2.4 reached end of life in March of 2016 and the SERVER project is for bugs or feature suggestions for supported versions of the MongoDB server.

For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-user group.

Thank you,
Eric

Comment by Santiago Ciciliani [ 19/Feb/19 ]

Further update.

We cloned the database from AWS AMI images and we run the same query through profiler. It turns out that new-restored cluster seems to be using the index and current prod is doing a sequential scan.

Any hint on how to fix this?

 

Generated at Thu Feb 08 04:52:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.