[SERVER-82549] MongoDB 7.0.2 SBE query slow when direct index exists Created: 29/Oct/23  Updated: 17/Nov/23  Resolved: 07/Nov/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Johnny Shields Assignee: David Storch
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File SERVER-82549_explain_fast_query     HTML File SERVER-82549_explain_slow_query     HTML File SERVER-82549_explain_slow_query_hinted    
Issue Links:
Duplicate
duplicates SERVER-62150 SBE Multiplanning can be slow when su... Open
Related
related to SERVER-82764 Make "planningTimeMicros" incorporate... Closed
is related to SERVER-82677 Deduplicate index scan + fetch plans ... In Code Review
Assigned Teams:
Query Execution
Operating System: ALL
Steps To Reproduce:

Here is an example query which executed in 1.173 seconds (we are seeing up to 10 seconds). Since there is a direct index on customer_user_id: 1, we would expect this to return immediately.

Note that if you remove deleted_at: null and sort: _id: 1 from the below the result is essentially the same.

{
  "type": "command",
  "ns": "tablesolution_production.customers",
  "appName": "_________-production-756fdf6d9-kvmmt",
  "command": {
    "find": "customers",
    "$db": "___________production",
    "filter": {
      "deleted_at": null,
      "franchise_id": {
        "$oid": "5b3069a900869bb32e00143f"
      },
      "customer_user_id": {
        "$oid": "6267a29a50102f009b84e9d1"
      }
    },
    "limit": 1,
    "sort": {
      "_id": 1
    },
    "$readPreference": {
      "mode": "secondaryPreferred",
      "maxStalenessSeconds": 90
    },
    "$clusterTime": {
      "clusterTime": {
        "$timestamp": {
          "t": 1698575535,
          "i": 168
        }
      },
      "signature": {
        "hash": {
          "$binary": {
            "base64": "n99TLiSJtTiHLaABvmlpW3MgUh0=",
            "subType": "00"
          }
        },
        "keyId": 7250897925230821000
      }
    },
    "lsid": {
      "id": {
        "$binary": {
          "base64": "ypdlguaMRtaKds622AtuZg==",
          "subType": "04"
        }
      }
    }
  },
  "planSummary": "IXSCAN { customer_user_id: 1 }",
  "planningTimeMicros": 752,
  "keysExamined": 1,
  "docsExamined": 1,
  "hasSortStage": true,
  "fromMultiPlanner": true,
  "nBatches": 1,
  "cursorExhausted": true,
  "numYields": 118,
  "nreturned": 0,
  "queryHash": "4EF91298",
  "planCacheKey": "40FDA966",
  "queryFramework": "sbe",
  "reslen": 248,
  "locks": {
    "FeatureCompatibilityVersion": {
      "acquireCount": {
        "r": 119
      }
    },
    "Global": {
      "acquireCount": {
        "r": 119
      }
    }
  },
  "readConcern": {
    "level": "local",
    "provenance": "implicitDefault"
  },
  "storage": {
    "data": {
      "bytesRead": 474208264,
      "timeReadingMicros": 334080
    }
  },
  "cpuNanos": 1036262981,
  "remote": "10.20.117.178:11491",
  "protocol": "op_msg",
  "durationMillis": 1173,
  "v": "7.0.2"
}

 

Sprint: QO 2023-11-13
Participants:
Case:

 Description   

(This ticket is one of several issues we at TableCheck had when upgrading from MongoDB 6.0.11 to 7.0.2; refer to https://support.mongodb.com/case/01207868)

We have a collection Customers with a scalar ID field "customer_user_id" (foreign key). When we query with complex conditions on this field, we see queries taking up to 10 seconds, for example:

 

"filter": {
  "franchise_id": "5b3069a900869bb32e00143f",
  "customer_user_id": "6267a29a50102f009b84e9d1"
}
(matches 1 document, executionTimeMillis = 6961)

If we remove the franchise_id condition, the query returns immediately:

 

"filter": {
  "customer_user_id": "6267a29a50102f009b84e9d1"
}
 
(matches 2 documents, executionTimeMillis = 0)

Note that both queries are ultimately selecting the same index (customer_user_id_1).

Anecdotally we did not see the same issue on the "Classic" query engine. Also, if we hint the affected query with the index it ultimately selects, it executes immediately (executionTimeMillis = 1)



 Comments   
Comment by Jussi Kauppila [ 17/Nov/23 ]

We have the exactly same issue. Query planner picks up the index for sort instead of the optimal index for the key value search.

    1. The  {score: 1} index is selected instead of the optimal index {internal_id: 1} . 

      db.coll.find({"internal_id": {"$in": ["XY1234567"]}}, {internal_id: 1}).sort({score: -1})

    1. If I remove the sort the optimal {internal_id: 1} index is used.

      db.coll.find({"internal_id": {"$in": ["XY1234567"]}}, {internal_id: 1}

      These queries were totally fine before we deployed to Mongo 7.

 

Comment by Johnny Shields [ 08/Nov/23 ]

Hi David, thank you for this detailed explanation, it is useful. I've left a comment on SERVER-62150 to give an idea of the scope of impact for us, hopefully it is useful in helping to get it prioritized.

Comment by David Storch [ 07/Nov/23 ]

Hi shields@tablecheck.com, thanks for providing this detailed report! The explain output was quite useful and allowed us to confirm that this is a manifestation of known issue SERVER-62150. Therefore, I am closing this issue as a duplicate of SERVER-62150 but you can read on for additional technical details.

MongoDB's query optimizer has always chosen a data access path (i.e. which index to use) by partially executing a set of candidate plans, and then examining the resulting runtime statistics in order to choose a winning plan. This process is called multi-planning, and it exists for both for the Classic Execution Engine as well as Slot-Based Execution (SBE). However, due to some implementation details that I won't describe here, SBE has its own implementation of multi-planning which differs from the Classic Engine implementation of multi-planning in some key respects. SERVER-62150 is a flaw in the SBE multi-planning process which can cause the partial execution of candidate plans (known as the "trial period") to go on for too long. In the explain output you provided for the slow query, I can see in the allPlansExecution section that that there are 15 candidate plans generated. The winning plan does just a handful of reads from storage, reading two index keys and two documents. Amongst the remaining 14 rejected plans, 13 of them did exactly 10,000 reads from storage. 10,000 is the preconfigured limit for the maximum number of reads allowed during the SBE trial period. Each of these trial periods is reported in the explain to have taken anywhere from ~100ms to more than 1 second, which clearly explains the poor query performance you are experiencing. The problem goes away when you provide a hint and when you rewrite the query to only include the predicate on customer_user_id, since in either case the system ends up generating just one candidate plan and therefore does not need to go through the SBE multi-planner.

Note that we have implemented a few improvements already around SBE multi-planning: see SERVER-63641 and SERVER-62981. We also implemented SERVER-63642 which should help us understand the prevalence of this issue internally. Of course, this ticket clearly demonstrates a situation in which SBE multi-planning can still perform poorly despite these improvements, so we'll be looking into other options for mitigating or resolving SERVER-62150. In the meantime, you can work around the issue using hint() or by disabling the SBE engine.

Hope that all makes sense, but let me know if you have any other follow-up questions!

Comment by Johnny Shields [ 29/Oct/23 ]

I have attached 3 explain() outputs:

  • For the slow query - customers filter: customer_user_id: X, franchise_id: Y
  • For the fast query - customers filter: customer_user_id: X
  • For the slow query, when hinted
Generated at Thu Feb 08 06:49:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.