-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Environment:Ubuntu 18.04.6 LTS
XSF
Kernel - 5.4.0-1088-aws #96~18.04.1-Ubuntu SMP Mon Oct 17 02:57:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Disable Transparent Huge disabled
AWS m5.2xlarge
SSD GP3 450 Gb
-
ALL
We're in the process of upgrading our MongoDB cluster from 4.4.15 to 5.0.10.
According to the upgrade documentation we stopped the balancer, then upgraded the config replica set.
Then we started upgrading the shards. After upgrading 3 out of 4 replicas we were ready to move primary onto already upgraded MongoDB instances.
Everything was ok for 9 of 10 shards.
However for the biggest shard (it contains most of the unsharded data) we ran into excessive CPU consumption. We tried to restart the mongod process, but to no avail.
Going over the logs didn't give us much insight regarding why the process was consuming so much CPU.
There were many "Slow queries" log records, which makes sense. Also we there were many "WriteConflict" log records like this one:
{"t":
{"$date":"2022-11-07T09:08:24.320+00:00"},"s":"W", "c":"COMMAND", "id":23802, "ctx":"conn3798","msg":"Plan executor error during findAndModify","attr":{"error":
{"code":112,"codeName":"WriteConflict","errmsg":"WriteConflict error: this operation conflicted with another operation. Please retry your operation or multi-document transaction."},"stats":{"stage":"UPDATE","nReturned":0,"executionTimeMillisEstimate":1,"works":14,"advanced":0,"needTime":12,"needYield":1,"saveState":1,"restoreState":1,"failed":true,"isEOF":1,"nMatched":0,"nWouldModify":0,"nWouldUpsert":0,"inputStage":{"stage":"LIMIT","nReturned":1,"executionTimeMillisEstimate":1,"works":13,"advanced":1,"needTime":12,"needYield":0,"saveState":2,"restoreState":1,"isEOF":1,"limitAmount":1,"inputStage":{"stage":"FETCH","nReturned":1,"executionTimeMillisEstimate":1,"works":13,"advanced":1,"needTime":12,"needYield":0,"saveState":2,"restoreState":1,"isEOF":0,"docsExamined":1,"alreadyHasObj":0,"inputStage":{"stage":"SORT_MERGE","nReturned":1,"executionTimeMillisEstimate":1,"works":13,"advanced":1,"needTime":12,"needYield":0,"saveState":2,"restoreState":1,"isEOF":0,"sortPattern":{"
{REDACTED}":1,"{REDACTED}":1},"dupsTested":1,"dupsDropped":0,"inputStages":[{"stage":"IXSCAN","nReturned":1,"executionTimeMillisEstimate":0,"works":1,"advanced":1,"needTime":0,"needYield":0,"saveState":2,"restoreState":1,"isEOF":0,"keyPattern":
{"REDACTED"}
,"indexName":"
{REDACTED}
","isMultiKey":false,"multiKeyPaths":{"
{REDACTED}":[],"
{REDACTED}
":[],"{REDACTED}
":[]},"isUnique":false,"isSparse":false,"isPartial":false,"indexVersion":2,"direction":"forward","indexBounds":{"
{REDACTED}":["[0, 0]"],"{REDACTED}":["[MinKey, MaxKey]"],"
{REDACTED}
":["[new Date(-9223372036854775808), new Date(1667812104306))"]},"keysExamined":1,"seeks":1,"dupsTested":0,"dupsDropped":0},{"stage":"IXSCAN","nReturned":0,"executionTimeMillisEstimate":1,"works":11,"advanced":0,"needTime":10,"needYield":0,"saveState":2,"restoreState":1,"isEOF":1,"keyPattern":{"
{REDACTED}":1,"
{REDACTED}
":1,"{REDACTED}
":1},"indexName":"
{REDACTED}","isMultiKey":false,"multiKeyPaths":{"{REDACTED}":[],"
{REDACTED}":[],"{REDACTED}":[]},"isUnique":false,"isSparse":false,"isPartial":false,"indexVersion":2,"direction":"forward","indexBounds":{"
{REDACTED}
":["[1, 1]"],"{REDACTED}
":["[MinKey, MaxKey]"],"
{REDACTED}":["[new Date(-9223372036854775808), new Date(1667812104306))"]},"keysExamined":11,"seeks":11,"dupsTested":0,"dupsDropped":0}]}}}},"cmd":{"findAndModify":"{REDACTED}","query":{"
{REDACTED}":{"$in":[0,1]},"{REDACTED}":{"$lt":
{"$date":"2022-11-07T09:08:24.306Z"}
}},"sort":{"
{REDACTED}
":1,"{REDACTED}
":1},"update":{"$set":{"
{REDACTED}":1,"{REDACTED}":
{"$date":"2022-11-07T09:19:54.305Z"}
,"
{REDACTED}
":{"$binary":{"base64":"hW7CcBc5Uki6JVK/hvU5ew==","subType":"3"}},"{REDACTED}
":{"$date":"2022-11-07T09:08:24.305Z"}},"$unset":{"
{REDACTED}":1},"$inc":{"{REDACTED}":1}},"runtimeConstants":{"localNow":
{"$date":"2022-11-07T09:08:24.305Z"},"clusterTime":{"$timestamp":
{"t":1667812104,"i":2118}}},"writeConcern":{"w":1,"wtimeout":0,"provenance":"implicitDefault"}}}}
We're urgently need help, because this is a show stopper for our MongoDB cluster optimization.
Please let me know if any additional information is required.