[SERVER-82548] MongoDB 7.0.2 SBE selects a different index when doing $in with a large array Created: 29/Oct/23  Updated: 30/Nov/23  Resolved: 30/Nov/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Johnny Shields Assignee: Ivan Fefer
Resolution: Duplicate Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File SERVER-82548_100_member_fast_case.txt     Text File SERVER-82548_101_member_slow_case.txt     PNG File Screenshot 2023-11-06 at 9.50.07 am.png     File atlas-lpiap3-shard-00-02.wg2kc.mongodb.net_2023-11-03T20_00_00_2023-11-04T04_00_00_mongodb.log.gz     PNG File image-2023-10-29-19-39-59-851.png    
Issue Links:
Duplicate
duplicates SERVER-20616 Plan ranker sampling from the beginni... Backlog
Related
related to SERVER-83712 Consider removing exact number of ele... Open
Assigned Teams:
Query Execution
Operating System: ALL
Steps To Reproduce:

You can see this by running explain with the following query on our production cluster:

find: "reservations"
filter:
  deleted_at: null
  _id: 
     $in: Array[1000]
sort:
  st: 1
  shop_id: 1 

The full response is below, note that it doesn't complete even after running for minutes.

 

 

Sprint: QO 2023-11-13, QE 2023-11-27, QE 2023-12-11
Participants:
Case:

 Description   

(This ticket is one of several issues we at TableCheck had when upgrading from MongoDB 6.0.11 to 7.0.2; refer to https://support.mongodb.com/case/01207868)

The following query seems to perform worse on MongoDB 7.0 vs. 6.0:

 

find: "reservations"
filter:
  deleted_at: null
  _id: 
     $in: Array[1000]
sort:
  st: 1
  shop_id: 1

We expect this to ignore sort and select the 1000 _id members based on the build-in _id index.

We observe similar effects on other indexed fields (aside from _id) where the sort condition seems to be taking priority over a direct index selection of the $in members.

Note that this sort of query is very important to avoid N+1 queries on the client side. For example, suppose we have 1000 payments, we will select the payments, map Payment.reservation_id, and then query for Reservation _id: $in => {reservation_ids}

 



 Comments   
Comment by Johnny Shields [ 30/Nov/23 ]

ivan.fefer@mongodb.com Fine to close ticket as dupe.

I am glad MongoDB team is now appreciating the magnitude of issues with the SBE multiplanner.

Comment by Ivan Fefer [ 30/Nov/23 ]

Yes, I understand that this is unfortunate.

Especially if wrong plan is cached, it might take some time for mongodb to realize that it is actually bad and it needs to re-plan. This might happen on the next query, but the latency for that query will be much higher, as it will do some work until it realizes that the plan is less efficient than expected, then go through multi planning again and only then executed the query itself.

This problem is expressed in SERVER-20616. We should be able to address with new optimizer.

In this case it is also amplified by the fact that we have separate cache entries for each $in array size, which means that we are risking this side effect ~1000 times more. This should be addressed with SERVER-83712.

In general, we are actively working on a project to fix multi planning issues with SBE ASAP.
As a quick fix on your side, disabling SBE or adding a hint to the query should help.

If there are no more questions, I would like to close this ticket as a duplicate of SERVER-20616.

Thank you.

Comment by Johnny Shields [ 29/Nov/23 ]

Ivan, I did another test and your theory that if I take 101 IDs randomly from the middle of the database, the issue appears to go away. The case I triggered it in the attached files I was taking the first 101 IDs from the Reservations collection. This perhaps why the wrong plan reached EOF first.

BUT--anecdotally, I did see this issue in production. Imagine I am getting several such queries like this each second, probabilistically I may hit "wrong plan" even 0.1% of the time and it can cause my database CPU to max out.

Comment by Ivan Fefer [ 29/Nov/23 ]

a) Query Reservations filter: { restaurant_id: <my_restaurant._id> }. This will limit to a batch size of I think 101 according to this blog post: https://www.mongodb.com/community/forums/t/mis-understanding-batchsize/169713/2
b) Gather all Reservation.customer_ids from the Reservations in (a) into one list. Note: Although there are 101 reservations in the batch from (a), since Reservation.customer_ids is an array field, the gathered list of customer ids in this step can have more than 101 members.
c) Query Customers filter: { _id:

{ $in: <customer ids from (b)> }

}

 

This looks like query customer filter can basically have an arbitrary amount of ids in $in.
I filed SERVER-83712 to address this issue in the future.

Comment by Ivan Fefer [ 29/Nov/23 ]

I have looked at attached explains.
We are interested in two plans. Summary of this plans:

  1. Index scan of _id on multiple single-element ranges  + sort. Let's call it correct plan.
  2. Index scan of shop_id on  [MinKey, Max] that provides the sort + fetch with _id filter. Let's call it wrong plan.

In SERVER-82548_100_member_fast_case.txt:

Wrong plan performs 10k reads (trial limit), finds 100 documents, doesn't reach EOF. 

Correct plan performs 300 reads (more on this later), finds 100 documents and reaches EOF. Because it reaches EOF, it is considered the best plan and it returned quickly.

In SERVER-82548_101_member_slow_case.txt:

Wrong plan performs 202 reads and finds 101 document (trial limit) and doesn't reach EOF.
However, because it found 101 document, we updated trial limit for reads to 202 reads (instead of initial 10k).

The fact that wrong plan managed to find 101 documents with totalDocsExamined: 101 suggests that _id's in $in are the first 101 ids from the index. This indicates an extreme case of SERVER-20616
shields@tablecheck.com Is it possible to repeat the experiment, but with different set of _ids in $in? My findings suggest that if we take another set of _ids for $in, the problem would not be reproduced.

But there is more.

The correct plan should at least match the efficiency of incorrect plan (it will still loose because of noSort bonus).
However, correct plan performs 202 reads (new limit), but only finds 68 documents in this time.
This is because ixscan on multiple single-document ranges have read count amplification: for each key it counts 3 reads:

  1. Seek cursor to the start of the range.
  2. Fetch the document.
  3. Read next index key and see that it is outside of the current range, so we can move to the next range.

This may be a bug to be fixed, but we decided to move away from SBE multiplanner completely, so I won't be creating a ticket for it, unless we change our current plan.

cc david.storch@mongodb.com 

Comment by Johnny Shields [ 29/Nov/23 ]

ivan.fefer@mongodb.com to answer:

1. The number of elements in the query will vary from query-to-query, but can often be several hundred. Our app uses the Mongoid ODM and this occurs when doing eager loading to avoid N+1 queries. For example, suppose I have a list of Reservations for a Restaurant and I want to get all Customer names, in Mongoid I will write:

my_restaurant.reservation.includes(:customers).map(&:name)

Assuming I've already loaded the Restaurant, this will execute in the following steps:

a) Query Reservations filter: { restaurant_id: <my_restaurant._id> }. This will limit to a batch size of I think 101 according to this blog post: https://www.mongodb.com/community/forums/t/mis-understanding-batchsize/169713/2
b) Gather all Reservation.customer_ids from the Reservations in (a) into one list. Note: Although there are 101 reservations in the batch from (a), since Reservation.customer_ids is an array field, the gathered list of customer ids in this step can have more than 101 members.
c) Query Customers filter: { _id: { $in: <customer ids from (b)> } }
d) Do GETMORE on the query in (a), repeat steps b-d until all Reservations/Customers loaded.

2) No, we do not do manual pagination limit/skip at all. We do use GETMORE as noted in answer above.

 

CC: alexbevi 

Comment by Ivan Fefer [ 29/Nov/23 ]

One thing I see in metrics is also almost linear growth of planCache size. I know that this is not the main issue here, but I would like to ask a couple of questions about it:

1. Is the number of elements in $in in the query constant (~1000) or is it varied from query to query? Can it be any number from 1 to a lot or is it batched like 1, 10, 50, 100, 1000?

2. Does the application does some manual pagination using a lot of limit and skip queries?

I am asking because in SBE the number of arguments in $in is a part of plan cache key, as well as exact limit and skip values.
Limit and skip values are being fixed in SERVER-83492, for $in there is no ticket yet, but if you can clarify that this is also a problem we may create it.

Thank you.

 

Comment by David Storch [ 17/Nov/23 ]

Thanks shields@tablecheck.com! What is certain is that the slowness in the attached explain output is a manifestation of SERVER-20616. That is our diagnosis. What's not clear is what is different about this (or related) queries between versions 6.0 and 7.0. In order to dig deeper onto that we'd need to see additional diagnostic output as requested above.

I was actually never able to get allPlansExecution to actually finish, even after waiting ~6 hours with this query, because I have ~200M records in this collection. I can try on staging data but it might not be apples-to-apples.

One thing you could try to gather this output is adding a maxTimeMS of something like 300,000 (5 minutes). I believe that explain("allPlansExecution") will still report useful partial statistics after the query times out, and that way you don't have to wait for a massive index scan to complete.

The number 100 is highly suspicious as a "magic number". Effects are not observed "at approximately 100", they are observed at "exactly 100". I suspect there are separate $in <= 100 and $in > 100 regimes in the code, but I haven't looked.

The magic number is 101 and is defined by the parameter internalQueryPlanEvaluationMaxResults. This controls the length of the plan selection trial period; namely, whenever any of the candidate plans produces this many results, the trial period ends. When the $in list is less then 101, we end up scanning the "unselective portion" of the index during the trial period, causing plan selection to correctly choose the SORT plan. When the $in list is greater than or equal to 101, the trial period terminates without ever looking at the "unselective portion" of the index. This is entirely consistent with SERVER-20616.

Comment by Johnny Shields [ 16/Nov/23 ]

Hi David, thank you for the detailed response:

> This makes it clear that you are testing against production data. But is this also a production query?

  • Yes, this same query in production was observed to be critically slow when we upgraded to MongoDB 7.0.
  • This caused our DB CPU to reach 100% in production, and so I went through the entire app codebase and hinted the ~20 instances of this query to use the right plan.
  • I am not sure whether it was related to the hinting, but eventually the query planner started caching my hinted pattern as the "winning plan", so now if I run the query, it executes quickly with the correct plan.
  • The explain() output however gives a clue to what was happening before the current winning plan was cached.

> Would you be able to retrieve "allPlansExecution" verbosity explain

I was actually never able to get allPlansExecution to actually finish, even after waiting ~6 hours with this query, because I have ~200M records in this collection. I can try on staging data but it might not be apples-to-apples.

> Do let us know what you find with SBE disabled and provide us with the relevant explain output

Will target to have this next week.

 

Some further thoughts:

  • The number 100 is highly suspicious as a "magic number". Effects are not observed "at approximately 100", they are observed at "exactly 100". I suspect there are separate $in <= 100 and $in > 100 regimes in the code, but I haven't looked.
  • Thinking very high level, it seems that SORT is much more important to how the SBE creates slots. So even if SERVER-20616 affects both SBE and Classic, its likely that the problem is exposed or exacerbated by the SBE design assumptions.
Comment by David Storch [ 16/Nov/23 ]

Thanks for the additional details shields@tablecheck.com. The script which I provided above reproduces exactly the plan selection error that you demonstrated with the attached explain output. I can also reproduce the facts that you describe – namely, that the problem reproduces with 101 $in elements but not with 100 and that the problem does not reproduce when removing the sort. The difference between what I see and what you describe is that I can reproduce the exact same behavior on 6.0 (using the classic engine), as well as on 7.0 with SBE enabled and on 7.0 with SBE disabled. So as far as I can tell, this seems to be a duplicate of the well-known issue SERVER-20616. Note that SERVER-20616 affects both the SBE and classic plan selection implementations because they implement a very similar algorithm.

I'd like to elaborate further on the nature of the problem, and why I believe SERVER-20616 is at play. For the purposes of discussion, let's just focus on two of the plans generated for the problematic query:

  1. A plan with no SORT stage which does a full scan on the {shop_id: 1, st: 1, deleted_at: 1} index. This is the slow/bad plan, and it's the one that's getting selected when the $in list has at least 101 elements.
  2. A plan with a SORT stage that looks up the matching values using the {_id: 1} index and then sorts them. This is the good/fast plan, and the problem is that it's not getting selected.

The plan selection process works by partially executing each of the two candidate plans until either one of them finishes or produces one full batch of results. It just so happens that as the bad plan starts scanning the {shop_id: 1, st: 1, deleted_at: 1} index, every single keyed document passes the $in filter. This is the case for the first 100 or so keys in the index, but virtually all other index keys encountered later in the index scan will be used to fetch a document that subsequently gets discarded. So in reality, the index scan is highly unselective. However, if the $in list has enough elements, the trial period will end before the plan starts examining this "unselective portion" of the index. As a result, the system estimates this bad index scan to be 100% selective and scores it highly. The SORT plan uses a selective index scan, but is disfavored because the plan costing function penalizes it for the cost of the SORT operation. Hopefully this is clear, but in essence the problem is that the first ~100 index keys examined are not representative of the data overall, and therefore confounds the plan selection algorithm. Again, this is a well-known problem that affects all versions of MongoDB and affects both the Classic Engine and SBE.

Just to double-check that we are entirely on the same page here, about this:

  • This explain data was gathered in our production environment.
  • I can easily reproduce this in my production environment.

This makes it clear that you are testing against production data. But is this also a production query? Does it seem correct to you that because of the way your application works, you would expect the documents matching the $in predicate to also be those documents which come first according to the sort order of {shop_id: 1, st: 1}?

There is no doubt about why the particular explain output that you attached is choosing the wrong plan. I understand, however, that this might not the full picture because you observed a difference in query performance between 6.0 and 7.0. To help debug a potential difference between 6.0 and 7.0 in more detail, would you be able to retrieve "allPlansExecution" verbosity explain output from an application query that is impacted? I'm not sure if it's possible for you to collect a "fast" explain from 6.0 and a "slow" explain from 7.0 to help us understand the difference. If not, then the 7.0 explain output for a query that you know definitively regressed relative to 6.0, ideally with a slow query log line associated with the query running against 6.0. If needed, we can provide a private/secure way for you to upload this information for us to examine.

I will check in my snapshot env if I can reproduce, and if yes will also try without SBE enabled.

Sounds good. Do let us know what you find with SBE disabled and provide us with the relevant explain output.

Comment by Johnny Shields [ 15/Nov/23 ]

Hi David, to give more color on this:

  • This explain data was gathered in our production environment.
  • I am confident this is an SBE-related effect because it never happened on MongoDB 6 and prior, and we began seeing it immediately after upgrading to 7 (with SBE enabled).
  • I can easily reproduce this in my production environment. I am using Atlas and I don't want to revert SBE there in order to keep things stable.
  • Removing the sort avoids the issue.
  • Querying with 100 $in members or less avoids the issue. (Adding the 101st member causes the issue.)

As next steps:

  • I will check in my snapshot env if I can reproduce, and if yes will also try without SBE enabled.
  • If someone from MongoDB is available to schedule a debug session with me, I am happy to screen-share my production Atlas cluster and try some queries.
Comment by David Storch [ 13/Nov/23 ]

Hi shields@tablecheck.com! I've had a chance to examine the attached explain output. Unfortunately, I don't think there's enough detail for us to determine the root cause, so I'd like to gather some additional information so that our Query Engineering team can understand what's going wrong.

  • First, was this explain gathered by running the query against real data or was it some sort of sandbox or reproduction environment? The reason that I ask is that there appears to be a suspicious correlation in the data. There are a few (presumably quite slow) candidate plans generated which avoid having to do a SORT by scanning one of the indexes prefixed with {shop_id: 1, st: 1}. For all such plans, all of the documents with the lowest values according to the sort order appear to pass the $in predicate on _id. Said another way, your test query appears to search for all of the first 101 _id values when the data is sorted by {shop_id: 1, st: 1}. Does this accurately reflect the queries that will actually run in your environment, or is it an artifact of how the repro was generated?
  • Are you able to reproduce the slow query after disabling the SBE engine? Assuming this is a test environment and not production, you can disable SBE with db.adminCommand({setParameter: 1, internalQueryFrameworkControl: "forceClassicEngine"}). If this is an Atlas cluster, you can reach out to our support team to make this parameter change on your behalf. It would be helpful to see an explain of the slow query after disabling SBE. The context here is that this could help to rule out an issue related specifically to the plan selection process used by the SBE engine. The comparison of an explain from SBE against an explain from the Classic Engine may give us more insight into what's happening.
  • If you have "allPlansExecution" verbosity explain output from any other manifestation of the problem (e.g. you mentioned you're seeing a similar problem for queries that do not mention _id), then this would be helpful as well.

As a final note, I should clarify that the plan selection problem in the attached explain output is a well-known problem. See SERVER-20616 for more details. As a more concrete example motivated by this bug report, I wrote the following script to demonstrate the problem. In this example the _id and a fields are perfectly correlated.

const TOTAL_DOCS = 1000;
const IN_LIST_SIZE = 110;
 
const coll = db.c;
coll.drop();
 
assert.commandWorked(coll.createIndex({a: 1}));
 
for (let i = 0; i < TOTAL_DOCS; ++i) {
    assert.commandWorked(coll.insert({_id: i, a: i}));
}
 
let inList = [];
for (let i = 0; i < IN_LIST_SIZE; ++i) {
    inList.push(i);
}
 
let explain = coll.find({_id: {$in: inList}}).sort({a: 1}).explain("allPlansExecution");
printjson(explain);

This script causes the system to choose a bad plan (a full index scan of index {a: 1}) for both SBE and the Classic Engine. I tested it against 7.0, but I would expect it to reproduce on pretty much all versions of MongoDB. However, since you are experiencing this problem specifically on upgrade to 7.0, it's not clear to me that this script captures the full story about what's going on in your environment – that's why I'm requesting additional information to dig in a bit further.

Comment by Johnny Shields [ 29/Oct/23 ]

Something very interesting here, if I set up to 100 members on the $in criteria vs 101+ members, the Query Planner selects different indexes. Therefore, it seems that 100 is a "magic number" in the code.

I have attached the explain() results for:

  • 100-member $in case (fast)
  • 101-member $in case (slow)

Unfortunately the 101+ member case is so slow that I had to kill the op before it could complete, but you can see the 100 member case selects the built-in _id_1 index, while the 101 case incorrectly selects shop_id_1_st_1_deleted_at_1. It seems that it's getting this due to the sort.

Generated at Thu Feb 08 06:49:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.