Loading...

Type: Bug
Resolution: Unresolved
Priority: Minor - P4
Fix Version/s: None
Affects Version/s: 4.4.13
Component/s: None
Labels:
None

Operating System:
ALL
Steps To Reproduce:

Hide

i can't reproduce it

Show
i can't reproduce it
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

1. Environment

MongoDB Version: 4.4.13
Storage Engine: WiredTiger
Topology: Sharded cluster

1. Problem Statement

We experience intermittent slow UPDATE queries in production. Single-document UPDATE operations with proper index (IXSCAN { assetId: 1 }, keysExamined=1, docsExamined=1) sometimes take 6-30 seconds during specific time windows. Most updates complete normally (milliseconds).

1. Key Evidence from Production Logs

1. 1. 1. Slow Query Log (6.5-second UPDATE at 12:20:03)

```json

{"t":\{"$date":"[sanitized_timestamp]"}

,"s":"I", "c":"WRITE", "id":51803, "ctx":"[sanitized_connection_id]","msg":"Slow query","attr":{"type":"update","ns":"[sanitized_database].[collection_x]_asset","command":{"q":

{"assetId":[sanitized_value]}

,"u":{"$set":{"ips":[

{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]}

,{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]},{"mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value],"addr":"[sanitized_ip]"},{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]},{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]},{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]},{"lastSeen":[sanitized_value],"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value]},{"firstSeen":[sanitized_value],"lastSeen":[sanitized_value],"addr":"[sanitized_ip]","mac":"[sanitized_mac]"},{"firstSeen":[sanitized_value],"lastSeen":[sanitized_value],"addr":"[sanitized_ip]","mac":"[sanitized_mac]"},{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]},{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]},{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]},{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]}],"belongHost":[sanitized_value],"branchId":[sanitized_value],"businessName":[],"assetId":[sanitized_value],"date":{"$date":"[sanitized_timestamp]"},"device":{"classify1Id":[sanitized_value],"classifyId":[sanitized_value],"magnitude":[sanitized_value],"name":"[sanitized_hostname]","hostname":"[sanitized_hostname]"},"branchName":"[sanitized_value]","businessId":[]}},"multi":false,"upsert":true},"planSummary":"IXSCAN { assetId: 1 }","keysExamined":1,"docsExamined":1,"nMatched":1,"nModified":1,"keysInserted":1,"keysDeleted":1,"numYields":1,"queryHash":"66E90436","planCacheKey":"2DCAD8F4","locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":4}},"ReplicationStateTransition":{"acquireCount":{"w":4}},"Global":{"acquireCount":{"r":1,"w":3}},"Database":{"acquireCount":{"r":1,"w":3}},"Collection":{"acquireCount":{"w":3}},"Mutex":{"acquireCount":

{"r":5}

}},"flowControl":{"acquireCount":2,"timeAcquiringMicros":1},"storage":{},"durationMillis":6558}}
```

*Critical*: Storage field is completely empty.

1. 1. 2. Profile Data (UPDATE stage: 2485ms, storage timeReadingMicros: 76μs)

```javascript
{ "op" : "update", "ns" : "[sanitized_database].[collection_x]_asset", "command" : { "q" :

{ "assetId" : [sanitized_value] }

, "u" : { "$set" : { "device" :

{ "classify1Id" : [sanitized_value], "classifyId" : [sanitized_value], "magnitude" : [sanitized_value], "name" : "[sanitized_value]", "hostname" : "[sanitized_hostname]" }

, "branchName" : "[sanitized_value]", "businessId" : [ ], "belongHost" : [sanitized_value], "assetId" : [sanitized_value], "branchId" : [sanitized_value], "businessName" : [ ], "ips" : [ { "addr" : "[sanitized_ip]", "mac" : "[sanitized_mac]", "firstSeen" : [sanitized_value], "lastSeen" : [sanitized_value] }, { "lastSeen" : [sanitized_value], "addr" : "[sanitized_ip]", "mac" : "[sanitized_mac]", "firstSeen" : [sanitized_value] } ], "date" : ISODate("[sanitized_timestamp]") } }, "multi" : false, "upsert" : true }, "keysExamined" : 1, "docsExamined" : 1, "nMatched" : 1, "nModified" : 1, "keysInserted" : 1, "keysDeleted" : 1, "numYield" : 1, "queryHash" : "66E90436", "planCacheKey" : "2DCAD8F4", "locks" : { "ParallelBatchWriterMode" : { "acquireCount" :

{ "r" : NumberLong(4) }

}, "ReplicationStateTransition" : { "acquireCount" :

{ "w" : NumberLong(5) }

}, "Global" : { "acquireCount" :

{ "r" : NumberLong(2), "w" : NumberLong(3) }

}, "Database" : { "acquireCount" :

{ "r" : NumberLong(1), "w" : NumberLong(3) }

}, "Collection" : { "acquireCount" :

{ "w" : NumberLong(3) }

}, "Mutex" : { "acquireCount" :

{ "r" : NumberLong(5) }

} }, "flowControl" : { "acquireCount" : NumberLong(2) }, "storage" : { "data" :

{ "bytesRead" : NumberLong(53186), "timeReadingMicros" : NumberLong(76) }

}, "millis" : 2484, "planSummary" : "IXSCAN { assetId: 1 }", "execStats" : { "stage" : "UPDATE", "nReturned" : 0, "executionTimeMillisEstimate" : 2485, "works" : 2, "advanced" : 0, "needTime" : 1, "needYield" : 0, "saveState" : 1, "restoreState" : 1, "isEOF" : 1, "nMatched" : 1, "nWouldModify" : 1, "wouldInsert" : false, "inputStage" : { "stage" : "FETCH", "nReturned" : 1, "executionTimeMillisEstimate" : 0, "works" : 1, "advanced" : 1, "needTime" : 0, "needYield" : 0, "saveState" : 2, "restoreState" : 2, "isEOF" : 0, "docsExamined" : 1, "alreadyHasObj" : 0, "inputStage" : { "stage" : "IXSCAN", "nReturned" : 1, "executionTimeMillisEstimate" : 0, "works" : 1, "advanced" : 1, "needTime" : 0, "needYield" : 0, "saveState" : 2, "restoreState" : 2, "isEOF" : 0, "keyPattern" :

{ "assetId" : 1 }

, "indexName" : "assetId_1", "isMultiKey" : false, "multiKeyPaths" : { "assetId" : [ ] }, "isUnique" : true, "isSparse" : false, "isPartial" : false, "indexVersion" : 2, "direction" : "forward", "indexBounds" : { "assetId" : [ "[[sanitized_value].0, [sanitized_value].0]" ] }, "keysExamined" : 1, "seeks" : 1, "dupsTested" : 0, "dupsDropped" : 0 } } }, "ts" : ISODate("[sanitized_timestamp]"), "client" : "[sanitized_ip]", "allUsers" : [ { "user" : "_system", "db" : "local" } ], "user" : "_system@local" }
```

*Critical Finding*: UPDATE stage takes 2485ms but storage timeReadingMicros is only 76 microseconds (0.076ms).

1. 1. 3. mongostat Data (Updates stop completely for 5-6 seconds, then burst)

```
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
*0 227 4589 *0 4307 2609|0 3.2% 79.8% 0 42.2G 12.1G 0|2 1|1 17.9m 30.4m 117 shard-1 PRI Jan 20 12:19:52.405
*0 238 4719 *0 4515 2640|0 3.0% 79.8% 0 42.2G 12.1G 0|0 2|2 17.4m 23.4m 117 shard-1 PRI Jan 20 12:19:53.403
*0 251 3705 *0 4675 2455|0 3.1% 79.9% 0 42.2G 12.1G 0|0 1|0 12.6m 26.2m 115 shard-1 PRI Jan 20 12:19:54.400
*0 200 3599 *0 4763 2783|0 3.2% 80.0% 0 42.2G 12.1G 0|2 1|0 12.8m 16.8m 114 shard-1 PRI Jan 20 12:19:55.402
*0 227 3740 *0 4832 2860|0 3.3% 80.0% 0 42.2G 12.1G 0|0 1|0 12.2m 18.9m 111 shard-1 PRI Jan 20 12:19:56.403
*0 227 2557 *0 2887 1659|0 1.7% 80.0% 1 42.2G 12.2G 0|0 1|15 8.77m 10.3m 113 shard-1 PRI Jan 20 12:19:57.400
*0 279 2 *0 0 12|0 0.4% 80.0% 0 42.2G 12.2G 0|0 1|18 382k 993k 115 shard-1 PRI Jan 20 12:19:58.402
*0 259 *0 *0 10 13|0 0.4% 80.0% 0 42.2G 12.2G 0|0 1|19 362k 2.87m 116 shard-1 PRI Jan 20 12:19:59.404
*0 255 *0 *0 7 4|0 0.4% 80.0% 0 42.2G 12.2G 0|0 2|19 350k 2.89m 116 shard-1 PRI Jan 20 12:20:00.404
*0 224 1 *0 13 10|0 0.4% 80.0% 0 42.2G 12.2G 0|0 1|20 317k 8.52m 117 shard-1 PRI Jan 20 12:20:01.403
*0 260 *0 *0 16 6|0 0.4% 80.0% 0 42.2G 12.2G 0|0 1|20 363k 7.27m 117 shard-1 PRI Jan 20 12:20:02.401
*0 223 *0 *0 4 26|0 0.4% 79.9% 0 42.2G 12.2G 0|0 1|20 308k 2.60m 117 shard-1 PRI Jan 20 12:20:03.404
*0 211 5357 *0 2550 1207|0 1.0% 80.1% 0 42.2G 12.2G 0|0 2|8 14.6m 21.5m 118 shard-1 PRI Jan 20 12:20:04.403
```

*Critical Finding*:

Update operations dropped from ~3700/sec to 0/sec (99.9% decrease)
Updates completely stopped for 5 seconds (12:19:59-12:20:03)
Then suddenly burst to 5357/sec (182% above normal)
Dirty memory stayed in normal range (0.4-3.3%)

1. Analysis and Questions for MongoDB Team

Based on the evidence above, we observed:

*Key Observations:*
1. Storage layer is extremely fast: timeReadingMicros = 76μs (0.076ms)
2. MongoDB upper layer takes much longer: UPDATE stage = 2485ms
3. Updates don't slow down - they completely stop for 5-6 seconds, then burst
4. Slow query log storage field is empty (no cache wait indicators)
5. mongostat dirty memory is in normal range (no evidence of storage pressure)
6. The slow query event occurs when flushes=1 (at 12:19:57), but *not all flush events cause slow queries* - in fact, updates execute very fast (milliseconds) during most other flush events
7. *No multi-document transactions were used* in this workload

*This suggests the problem is NOT in WiredTiger storage layer, but in MongoDB's upper-layer execution logic.*

1. Questions for Support

1. *Is it expected behavior for UPDATE operations to take 2485ms in execution stage when storage only takes 76μs?* What could cause such a large time difference?

2. *What could cause UPDATE operations to completely stop for 5-6 seconds?* The mongostat data shows updates drop to zero, then burst. This pattern suggests operations are waiting/blocked, not just slow.

3. *How should we diagnose this?* What specific metrics or logs should we collect to determine the root cause?

4. *What profiling or monitoring recommendations do you have* to help identify where the 2485ms are spent?

Thank you for your investigation. We need guidance on how to diagnose and resolve this performance issue.

Details

Description

Attachments

Activity

People

Dates