[SERVER-71246] Excessive CPU consumption of newly 5.0 upgraded instance Created: 10/Nov/22  Updated: 11/Nov/22  Resolved: 10/Nov/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vladimir Beliakov Assignee: Yuan Fang
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 18.04.6 LTS
XSF
Kernel - 5.4.0-1088-aws #96~18.04.1-Ubuntu SMP Mon Oct 17 02:57:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Disable Transparent Huge disabled
AWS m5.2xlarge
SSD GP3 450 Gb


Attachments: PNG File Screen Shot 2022-11-10 at 4.46.12 PM.png     Zip Archive diagnostics.data.zip    
Operating System: ALL
Participants:

 Description   

We're in the process of upgrading our MongoDB cluster from 4.4.15 to 5.0.10.
According to the upgrade documentation we stopped the balancer, then upgraded the config replica set.
Then we started upgrading the shards. After upgrading 3 out of 4 replicas we were ready to move primary onto already upgraded MongoDB instances.
Everything was ok for 9 of 10 shards.
However for the biggest shard (it contains most of the unsharded data) we ran into excessive CPU consumption. We tried to restart the mongod process, but to no avail.
Going over the logs didn't give us much insight regarding why the process was consuming so much CPU.
There were many "Slow queries" log records, which makes sense. Also we there were many "WriteConflict" log records like this one:

{"t":

{"$date":"2022-11-07T09:08:24.320+00:00"}

,"s":"W", "c":"COMMAND", "id":23802, "ctx":"conn3798","msg":"Plan executor error during findAndModify","attr":{"error":

{"code":112,"codeName":"WriteConflict","errmsg":"WriteConflict error: this operation conflicted with another operation. Please retry your operation or multi-document transaction."}

,"stats":{"stage":"UPDATE","nReturned":0,"executionTimeMillisEstimate":1,"works":14,"advanced":0,"needTime":12,"needYield":1,"saveState":1,"restoreState":1,"failed":true,"isEOF":1,"nMatched":0,"nWouldModify":0,"nWouldUpsert":0,"inputStage":{"stage":"LIMIT","nReturned":1,"executionTimeMillisEstimate":1,"works":13,"advanced":1,"needTime":12,"needYield":0,"saveState":2,"restoreState":1,"isEOF":1,"limitAmount":1,"inputStage":{"stage":"FETCH","nReturned":1,"executionTimeMillisEstimate":1,"works":13,"advanced":1,"needTime":12,"needYield":0,"saveState":2,"restoreState":1,"isEOF":0,"docsExamined":1,"alreadyHasObj":0,"inputStage":{"stage":"SORT_MERGE","nReturned":1,"executionTimeMillisEstimate":1,"works":13,"advanced":1,"needTime":12,"needYield":0,"saveState":2,"restoreState":1,"isEOF":0,"sortPattern":{"

{REDACTED}":1,"{REDACTED}

":1},"dupsTested":1,"dupsDropped":0,"inputStages":[{"stage":"IXSCAN","nReturned":1,"executionTimeMillisEstimate":0,"works":1,"advanced":1,"needTime":0,"needYield":0,"saveState":2,"restoreState":1,"isEOF":0,"keyPattern":
{"REDACTED"}

,"indexName":"
{REDACTED}

","isMultiKey":false,"multiKeyPaths":{"

{REDACTED}

":[],"
{REDACTED}

":[],"{REDACTED}

":[]},"isUnique":false,"isSparse":false,"isPartial":false,"indexVersion":2,"direction":"forward","indexBounds":{"

{REDACTED}":["[0, 0]"],"{REDACTED}

":["[MinKey, MaxKey]"],"
{REDACTED}

":["[new Date(-9223372036854775808), new Date(1667812104306))"]},"keysExamined":1,"seeks":1,"dupsTested":0,"dupsDropped":0},{"stage":"IXSCAN","nReturned":0,"executionTimeMillisEstimate":1,"works":11,"advanced":0,"needTime":10,"needYield":0,"saveState":2,"restoreState":1,"isEOF":1,"keyPattern":{"

{REDACTED}

":1,"
{REDACTED}

":1,"{REDACTED}

":1},"indexName":"

{REDACTED}","isMultiKey":false,"multiKeyPaths":{"{REDACTED}

":[],"

{REDACTED}":[],"{REDACTED}

":[]},"isUnique":false,"isSparse":false,"isPartial":false,"indexVersion":2,"direction":"forward","indexBounds":{"
{REDACTED}

":["[1, 1]"],"{REDACTED}

":["[MinKey, MaxKey]"],"

{REDACTED}":["[new Date(-9223372036854775808), new Date(1667812104306))"]},"keysExamined":11,"seeks":11,"dupsTested":0,"dupsDropped":0}]}}}},"cmd":{"findAndModify":"{REDACTED}

","query":{"

{REDACTED}":{"$in":[0,1]},"{REDACTED}

":{"$lt":
{"$date":"2022-11-07T09:08:24.306Z"}

}},"sort":{"
{REDACTED}

":1,"{REDACTED}

":1},"update":{"$set":{"

{REDACTED}":1,"{REDACTED}

":
{"$date":"2022-11-07T09:19:54.305Z"}

,"
{REDACTED}

":{"$binary":{"base64":"hW7CcBc5Uki6JVK/hvU5ew==","subType":"3"}},"{REDACTED}

":{"$date":"2022-11-07T09:08:24.305Z"}},"$unset":{"

{REDACTED}":1},"$inc":{"{REDACTED}

":1}},"runtimeConstants":{"localNow":

{"$date":"2022-11-07T09:08:24.305Z"}

,"clusterTime":{"$timestamp":

{"t":1667812104,"i":2118}

}},"writeConcern":{"w":1,"wtimeout":0,"provenance":"implicitDefault"}}}}

 

We're urgently need help, because this is a show stopper for our MongoDB cluster optimization.

Please let me know if any additional information is required.



 Comments   
Comment by Vladimir Beliakov [ 11/Nov/22 ]

Hi Yuan Fang,

Thank you for looking into this!
The spike of updates you mentioned is probably related to the instance becoming primary. And since that instance is the biggest shard which holds a lot of data and usually receives a lot writes, I assume the update spike is related to that.
Surprisingly, we don't see the same behavior on the other 9 shards. Usually they receive less writes, but no CPU consumption increase was observed by us.

Comment by Yuan Fang [ 10/Nov/22 ]

Hi vladimirred456@gmail.com,

Thank you for your report. With regards to your question:

Going over the logs didn't give us much insight regarding why the process was consuming so much CPU.

By looking at the diagnostic data you provided. At point A (the point upgrade to 5.0) through point B, I can see spikes of update, which seem correlated to write conflicts spikes. The write conflicts and update went down to 0 after point B, meanwhile, the CPU utilization also went down. Therefore, the high CPU utilization after the upgrade is likely caused by an increase in update operations.

As for what causes the occurrence of write conflicts, I don't have a clear answer, usually, it is due to multiple clients trying to update one document at the same time. I would check the oplog to find more information if write conflicts occur again.

For this issue, we'd like to encourage you to start by asking our community for help by posting on the MongoDB Developer Community Forums.

If the discussion there leads you to suspect a bug in the MongoDB server, then we'd want to investigate it as a possible bug here in the SERVER project.
 
Regards,
Yuan

Comment by Vladimir Beliakov [ 10/Nov/22 ]

Sorry, I forgot to mention that the incident (when we switched the primary to the 5.0 upgraded mongod instance) was happening from 2022-11-10 09:03:00 to 2022-11-10 09:07:00.

Generated at Thu Feb 08 06:18:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.