MongoDB Slow Update Queries: Intermittent 6-30 Second Latency Despite Optimal Query Plans

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Minor - P4
    • None
    • Affects Version/s: 4.4.13
    • Component/s: None
    • None
    • ALL
    • Hide

      i can't reproduce it

      Show
      i can't reproduce it
    • None
    • None
    • None
    • None
    • None
    • None
    • None

        1. Environment
      • MongoDB Version: 4.4.13
      • Storage Engine: WiredTiger
      • Topology: Sharded cluster
        1. Problem Statement

      We experience intermittent slow UPDATE queries in production. Single-document UPDATE operations with proper index (IXSCAN { assetId: 1 }, keysExamined=1, docsExamined=1) sometimes take 6-30 seconds during specific time windows. Most updates complete normally (milliseconds).

        1. Key Evidence from Production Logs
          1. 1. Slow Query Log (6.5-second UPDATE at 12:20:03)

      ```json

      {"t":\{"$date":"[sanitized_timestamp]"}

      ,"s":"I",  "c":"WRITE",    "id":51803,   "ctx":"[sanitized_connection_id]","msg":"Slow query","attr":{"type":"update","ns":"[sanitized_database].[collection_x]_asset","command":{"q":

      {"assetId":[sanitized_value]}

      ,"u":{"$set":{"ips":[

      {"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]}

      ,{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]},{"mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value],"addr":"[sanitized_ip]"},{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]},{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]},{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]},{"lastSeen":[sanitized_value],"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value]},{"firstSeen":[sanitized_value],"lastSeen":[sanitized_value],"addr":"[sanitized_ip]","mac":"[sanitized_mac]"},{"firstSeen":[sanitized_value],"lastSeen":[sanitized_value],"addr":"[sanitized_ip]","mac":"[sanitized_mac]"},{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]},{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]},{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]},{"addr":"[sanitized_ip]","mac":"[sanitized_mac]","firstSeen":[sanitized_value],"lastSeen":[sanitized_value]}],"belongHost":[sanitized_value],"branchId":[sanitized_value],"businessName":[],"assetId":[sanitized_value],"date":{"$date":"[sanitized_timestamp]"},"device":{"classify1Id":[sanitized_value],"classifyId":[sanitized_value],"magnitude":[sanitized_value],"name":"[sanitized_hostname]","hostname":"[sanitized_hostname]"},"branchName":"[sanitized_value]","businessId":[]}},"multi":false,"upsert":true},"planSummary":"IXSCAN { assetId: 1 }","keysExamined":1,"docsExamined":1,"nMatched":1,"nModified":1,"keysInserted":1,"keysDeleted":1,"numYields":1,"queryHash":"66E90436","planCacheKey":"2DCAD8F4","locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":4}},"ReplicationStateTransition":{"acquireCount":{"w":4}},"Global":{"acquireCount":{"r":1,"w":3}},"Database":{"acquireCount":{"r":1,"w":3}},"Collection":{"acquireCount":{"w":3}},"Mutex":{"acquireCount":

      {"r":5}

      }},"flowControl":{"acquireCount":2,"timeAcquiringMicros":1},"storage":{},"durationMillis":6558}}
      ```

      *Critical*: Storage field is completely empty.

          1. 2. Profile Data (UPDATE stage: 2485ms, storage timeReadingMicros: 76μs)

      ```javascript
      { "op" : "update", "ns" : "[sanitized_database].[collection_x]_asset", "command" : { "q" :

      { "assetId" : [sanitized_value] }

      , "u" : { "$set" : { "device" :

      { "classify1Id" : [sanitized_value], "classifyId" : [sanitized_value], "magnitude" : [sanitized_value], "name" : "[sanitized_value]", "hostname" : "[sanitized_hostname]" }

      , "branchName" : "[sanitized_value]", "businessId" : [ ], "belongHost" : [sanitized_value], "assetId" : [sanitized_value], "branchId" : [sanitized_value], "businessName" : [ ], "ips" : [ { "addr" : "[sanitized_ip]", "mac" : "[sanitized_mac]", "firstSeen" : [sanitized_value], "lastSeen" : [sanitized_value] }, { "lastSeen" : [sanitized_value], "addr" : "[sanitized_ip]", "mac" : "[sanitized_mac]", "firstSeen" : [sanitized_value] } ], "date" : ISODate("[sanitized_timestamp]") } }, "multi" : false, "upsert" : true }, "keysExamined" : 1, "docsExamined" : 1, "nMatched" : 1, "nModified" : 1, "keysInserted" : 1, "keysDeleted" : 1, "numYield" : 1, "queryHash" : "66E90436", "planCacheKey" : "2DCAD8F4", "locks" : { "ParallelBatchWriterMode" : { "acquireCount" :

      { "r" : NumberLong(4) }

      }, "ReplicationStateTransition" : { "acquireCount" :

      { "w" : NumberLong(5) }

      }, "Global" : { "acquireCount" :

      { "r" : NumberLong(2), "w" : NumberLong(3) }

      }, "Database" : { "acquireCount" :

      { "r" : NumberLong(1), "w" : NumberLong(3) }

      }, "Collection" : { "acquireCount" :

      { "w" : NumberLong(3) }

      }, "Mutex" : { "acquireCount" :

      { "r" : NumberLong(5) }

      } }, "flowControl" : { "acquireCount" : NumberLong(2) }, "storage" : { "data" :

      { "bytesRead" : NumberLong(53186), "timeReadingMicros" : NumberLong(76) }

      }, "millis" : 2484, "planSummary" : "IXSCAN { assetId: 1 }", "execStats" : { "stage" : "UPDATE", "nReturned" : 0, "executionTimeMillisEstimate" : 2485, "works" : 2, "advanced" : 0, "needTime" : 1, "needYield" : 0, "saveState" : 1, "restoreState" : 1, "isEOF" : 1, "nMatched" : 1, "nWouldModify" : 1, "wouldInsert" : false, "inputStage" : { "stage" : "FETCH", "nReturned" : 1, "executionTimeMillisEstimate" : 0, "works" : 1, "advanced" : 1, "needTime" : 0, "needYield" : 0, "saveState" : 2, "restoreState" : 2, "isEOF" : 0, "docsExamined" : 1, "alreadyHasObj" : 0, "inputStage" : { "stage" : "IXSCAN", "nReturned" : 1, "executionTimeMillisEstimate" : 0, "works" : 1, "advanced" : 1, "needTime" : 0, "needYield" : 0, "saveState" : 2, "restoreState" : 2, "isEOF" : 0, "keyPattern" :

      { "assetId" : 1 }

      , "indexName" : "assetId_1", "isMultiKey" : false, "multiKeyPaths" : { "assetId" : [ ] }, "isUnique" : true, "isSparse" : false, "isPartial" : false, "indexVersion" : 2, "direction" : "forward", "indexBounds" : { "assetId" : [ "[[sanitized_value].0, [sanitized_value].0]" ] }, "keysExamined" : 1, "seeks" : 1, "dupsTested" : 0, "dupsDropped" : 0 } } }, "ts" : ISODate("[sanitized_timestamp]"), "client" : "[sanitized_ip]", "allUsers" : [ { "user" : "_system", "db" : "local" } ], "user" : "_system@local" }
      ```

       

      *Critical Finding*: UPDATE stage takes 2485ms but storage timeReadingMicros is only 76 microseconds (0.076ms).

          1. 3. mongostat Data (Updates stop completely for 5-6 seconds, then burst)

      ```
      insert query update delete getmore command dirty  used flushes vsize   res  qrw  arw net_in net_out conn     set repl                time
          *0   227   4589     *0    4307  2609|0  3.2% 79.8%       0 42.2G 12.1G  0|2  1|1  17.9m   30.4m  117 shard-1  PRI Jan 20 12:19:52.405
          *0   238   4719     *0    4515  2640|0  3.0% 79.8%       0 42.2G 12.1G  0|0  2|2  17.4m   23.4m  117 shard-1  PRI Jan 20 12:19:53.403
          *0   251   3705     *0    4675  2455|0  3.1% 79.9%       0 42.2G 12.1G  0|0  1|0  12.6m   26.2m  115 shard-1  PRI Jan 20 12:19:54.400
          *0   200   3599     *0    4763  2783|0  3.2% 80.0%       0 42.2G 12.1G  0|2  1|0  12.8m   16.8m  114 shard-1  PRI Jan 20 12:19:55.402
          *0   227   3740     *0    4832  2860|0  3.3% 80.0%       0 42.2G 12.1G  0|0  1|0  12.2m   18.9m  111 shard-1  PRI Jan 20 12:19:56.403
          *0   227   2557     *0    2887  1659|0  1.7% 80.0%       1 42.2G 12.2G  0|0 1|15  8.77m   10.3m  113 shard-1  PRI Jan 20 12:19:57.400
          *0   279      2     *0       0    12|0  0.4% 80.0%       0 42.2G 12.2G  0|0 1|18   382k    993k  115 shard-1  PRI Jan 20 12:19:58.402
          *0   259     *0     *0      10    13|0  0.4% 80.0%       0 42.2G 12.2G  0|0 1|19   362k   2.87m  116 shard-1  PRI Jan 20 12:19:59.404
          *0   255     *0     *0       7     4|0  0.4% 80.0%       0 42.2G 12.2G  0|0 2|19   350k   2.89m  116 shard-1  PRI Jan 20 12:20:00.404
          *0   224      1     *0      13    10|0  0.4% 80.0%       0 42.2G 12.2G  0|0 1|20   317k   8.52m  117 shard-1  PRI Jan 20 12:20:01.403
          *0   260     *0     *0      16     6|0  0.4% 80.0%       0 42.2G 12.2G  0|0 1|20   363k   7.27m  117 shard-1  PRI Jan 20 12:20:02.401
          *0   223     *0     *0       4    26|0  0.4% 79.9%       0 42.2G 12.2G  0|0 1|20   308k   2.60m  117 shard-1  PRI Jan 20 12:20:03.404
          *0   211   5357     *0    2550  1207|0  1.0% 80.1%       0 42.2G 12.2G  0|0  2|8  14.6m   21.5m  118 shard-1  PRI Jan 20 12:20:04.403
      ```

      *Critical Finding*:

      • Update operations dropped from ~3700/sec to 0/sec (99.9% decrease)
      • Updates completely stopped for 5 seconds (12:19:59-12:20:03)
      • Then suddenly burst to 5357/sec (182% above normal)
      • Dirty memory stayed in normal range (0.4-3.3%)
        1. Analysis and Questions for MongoDB Team

      Based on the evidence above, we observed:

       

      *Key Observations:*
      1. Storage layer is extremely fast: timeReadingMicros = 76μs (0.076ms)
      2. MongoDB upper layer takes much longer: UPDATE stage = 2485ms
      3. Updates don't slow down - they completely stop for 5-6 seconds, then burst
      4. Slow query log storage field is empty (no cache wait indicators)
      5. mongostat dirty memory is in normal range (no evidence of storage pressure)
      6. The slow query event occurs when flushes=1 (at 12:19:57), but *not all flush events cause slow queries* - in fact, updates execute very fast (milliseconds) during most other flush events
      7. *No multi-document transactions were used* in this workload

      *This suggests the problem is NOT in WiredTiger storage layer, but in MongoDB's upper-layer execution logic.*

        1. Questions for Support

      1. *Is it expected behavior for UPDATE operations to take 2485ms in execution stage when storage only takes 76μs?* What could cause such a large time difference?

      2. *What could cause UPDATE operations to completely stop for 5-6 seconds?* The mongostat data shows updates drop to zero, then burst. This pattern suggests operations are waiting/blocked, not just slow.

      3. *How should we diagnose this?* What specific metrics or logs should we collect to determine the root cause?

      4. *What profiling or monitoring recommendations do you have* to help identify where the 2485ms are spent?

      Thank you for your investigation. We need guidance on how to diagnose and resolve this performance issue.

            Assignee:
            Unassigned
            Reporter:
            stephen yang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: