CBR stalls query_read_commands workload

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Blocker - P1
    • None
    • Affects Version/s: None
    • Component/s: None
    • Query Optimization
    • ALL
    • Hide

      1. Checkout both mongo and dsi repos.
      2. Modify dsi defaults as described here: https://docs.google.com/document/d/17-TeI6UqZ63GeenrNvoXwr5LwsTV_xVa_0w2IOl_OSM/edit?tab=t.ty35r9g3xbpk#heading=h.vz9x5po3txxv
      3. Submit a patch including dsi's changes (evergreen patch -p sys-perf --include-modules) and provide dsi's path.
      4. When configuring, select Mongo-Perf Standalone inMemory ARM AWS 2023-11 variant and query_read_commands task.
      5. run the patch and wait 4h for it to timeout.

      Show
      1. Checkout both mongo and dsi repos. 2. Modify dsi defaults as described here: https://docs.google.com/document/d/17-TeI6UqZ63GeenrNvoXwr5LwsTV_xVa_0w2IOl_OSM/edit?tab=t.ty35r9g3xbpk#heading=h.vz9x5po3txxv 3. Submit a patch including dsi's changes (evergreen patch -p sys-perf --include-modules) and provide dsi's path. 4. When configuring, select Mongo-Perf Standalone inMemory ARM AWS 2023-11 variant and query_read_commands task. 5. run the patch and wait 4h for it to timeout.
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      query_read_commands sys-perf's task fails when CBR is on. The symptoms is that the workload exhausts admission control's tickets. It is also correlated with a number of high open connections. Looks like if something was not being properly freed.

      https://spruce.mongodb.com/task/sys_perf_perf_mongo_perf_standalone.arm.aws.2023_11_query_read_commands_patch_0a706c8745fee49aa025aad020fd881c98b88a58_68da48c1975d240007107325_25_09_29_08_54_20/logs?execution=0

      Logs are flooded with {"t":

      {"$date":"2025-09-29T12:43:20.210+00:00"}

      ,"s":"W", "c":"STORAGE", "id":8373000, "ctx":"ThroughputProbingTicketHolderMonitor","msg":"Unable to acquire a ticket within deadline, which indicates the system is stalled","attr":{"ticketPool":"reads","total":8,"target":7,"throughput":0,"queued":5,"stalledMicros":59999922}} messages.

      FTDC metrics show an increase and sustained level of readers, connections and WT cursors.

      Legend:

      • KO-CBR = Task timed out, CBR was enforced (planRankingMode = kAutomaticCE)
      • KO-MIX = Task timed out, CBR was only used as a fallback in cases where multiplanning was not able to either EOF or fill a batch. SEE SPM-4246 for reference
      • OK-MP = Task succeeded. Only multiplanner was used

        1. Screenshot 2025-09-29 at 15.09.51.png
          491 kB
          Carlos Alonso Pérez

            Assignee:
            Kartal Kaan Bozdogan
            Reporter:
            Carlos Alonso Pérez
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: