[SERVER-82548] MongoDB 7.0.2 SBE selects a different index when doing $in with a large array Created: 29/Oct/23 Updated: 30/Nov/23 Resolved: 30/Nov/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Johnny Shields | Assignee: | Ivan Fefer |
| Resolution: | Duplicate | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Query Execution
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Steps To Reproduce: | You can see this by running explain with the following query on our production cluster:
The full response is below, note that it doesn't complete even after running for minutes.
|
||||||||||||||||
| Sprint: | QO 2023-11-13, QE 2023-11-27, QE 2023-12-11 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||
| Description |
|
(This ticket is one of several issues we at TableCheck had when upgrading from MongoDB 6.0.11 to 7.0.2; refer to https://support.mongodb.com/case/01207868) The following query seems to perform worse on MongoDB 7.0 vs. 6.0:
We expect this to ignore sort and select the 1000 _id members based on the build-in _id index. We observe similar effects on other indexed fields (aside from _id) where the sort condition seems to be taking priority over a direct index selection of the $in members. Note that this sort of query is very important to avoid N+1 queries on the client side. For example, suppose we have 1000 payments, we will select the payments, map Payment.reservation_id, and then query for Reservation _id: $in => {reservation_ids}
|
| Comments |
| Comment by Johnny Shields [ 30/Nov/23 ] | |||||||||||||||||||
|
ivan.fefer@mongodb.com Fine to close ticket as dupe. I am glad MongoDB team is now appreciating the magnitude of issues with the SBE multiplanner. | |||||||||||||||||||
| Comment by Ivan Fefer [ 30/Nov/23 ] | |||||||||||||||||||
|
Yes, I understand that this is unfortunate. Especially if wrong plan is cached, it might take some time for mongodb to realize that it is actually bad and it needs to re-plan. This might happen on the next query, but the latency for that query will be much higher, as it will do some work until it realizes that the plan is less efficient than expected, then go through multi planning again and only then executed the query itself. This problem is expressed in SERVER-20616. We should be able to address with new optimizer. In this case it is also amplified by the fact that we have separate cache entries for each $in array size, which means that we are risking this side effect ~1000 times more. This should be addressed with SERVER-83712. In general, we are actively working on a project to fix multi planning issues with SBE ASAP. If there are no more questions, I would like to close this ticket as a duplicate of SERVER-20616. Thank you. | |||||||||||||||||||
| Comment by Johnny Shields [ 29/Nov/23 ] | |||||||||||||||||||
|
Ivan, I did another test and your theory that if I take 101 IDs randomly from the middle of the database, the issue appears to go away. The case I triggered it in the attached files I was taking the first 101 IDs from the Reservations collection. This perhaps why the wrong plan reached EOF first. BUT--anecdotally, I did see this issue in production. Imagine I am getting several such queries like this each second, probabilistically I may hit "wrong plan" even 0.1% of the time and it can cause my database CPU to max out. | |||||||||||||||||||
| Comment by Ivan Fefer [ 29/Nov/23 ] | |||||||||||||||||||
|
a) Query Reservations filter: { restaurant_id: <my_restaurant._id> }. This will limit to a batch size of I think 101 according to this blog post: https://www.mongodb.com/community/forums/t/mis-understanding-batchsize/169713/2 }
This looks like query customer filter can basically have an arbitrary amount of ids in $in. | |||||||||||||||||||
| Comment by Ivan Fefer [ 29/Nov/23 ] | |||||||||||||||||||
|
I have looked at attached explains.
In SERVER-82548_100_member_fast_case.txt Wrong plan performs 10k reads (trial limit), finds 100 documents, doesn't reach EOF. Correct plan performs 300 reads (more on this later), finds 100 documents and reaches EOF. Because it reaches EOF, it is considered the best plan and it returned quickly. In SERVER-82548_101_member_slow_case.txt Wrong plan performs 202 reads and finds 101 document (trial limit) and doesn't reach EOF. The fact that wrong plan managed to find 101 documents with totalDocsExamined: 101 suggests that _id's in $in are the first 101 ids from the index. This indicates an extreme case of SERVER-20616. But there is more. The correct plan should at least match the efficiency of incorrect plan (it will still loose because of noSort bonus).
This may be a bug to be fixed, but we decided to move away from SBE multiplanner completely, so I won't be creating a ticket for it, unless we change our current plan. | |||||||||||||||||||
| Comment by Johnny Shields [ 29/Nov/23 ] | |||||||||||||||||||
|
ivan.fefer@mongodb.com to answer: 1. The number of elements in the query will vary from query-to-query, but can often be several hundred. Our app uses the Mongoid ODM and this occurs when doing eager loading to avoid N+1 queries. For example, suppose I have a list of Reservations for a Restaurant and I want to get all Customer names, in Mongoid I will write: my_restaurant.reservation.includes(:customers).map(&:name) Assuming I've already loaded the Restaurant, this will execute in the following steps: a) Query Reservations filter: { restaurant_id: <my_restaurant._id> }. This will limit to a batch size of I think 101 according to this blog post: https://www.mongodb.com/community/forums/t/mis-understanding-batchsize/169713/2 2) No, we do not do manual pagination limit/skip at all. We do use GETMORE as noted in answer above.
CC: alexbevi | |||||||||||||||||||
| Comment by Ivan Fefer [ 29/Nov/23 ] | |||||||||||||||||||
|
One thing I see in metrics is also almost linear growth of planCache size. I know that this is not the main issue here, but I would like to ask a couple of questions about it: 1. Is the number of elements in $in in the query constant (~1000) or is it varied from query to query? Can it be any number from 1 to a lot or is it batched like 1, 10, 50, 100, 1000? 2. Does the application does some manual pagination using a lot of limit and skip queries? I am asking because in SBE the number of arguments in $in is a part of plan cache key, as well as exact limit and skip values. Thank you.
| |||||||||||||||||||
| Comment by David Storch [ 17/Nov/23 ] | |||||||||||||||||||
|
Thanks shields@tablecheck.com! What is certain is that the slowness in the attached explain output is a manifestation of SERVER-20616. That is our diagnosis. What's not clear is what is different about this (or related) queries between versions 6.0 and 7.0. In order to dig deeper onto that we'd need to see additional diagnostic output as requested above.
One thing you could try to gather this output is adding a maxTimeMS of something like 300,000 (5 minutes). I believe that explain("allPlansExecution") will still report useful partial statistics after the query times out, and that way you don't have to wait for a massive index scan to complete.
The magic number is 101 and is defined by the parameter internalQueryPlanEvaluationMaxResults. This controls the length of the plan selection trial period; namely, whenever any of the candidate plans produces this many results, the trial period ends. When the $in list is less then 101, we end up scanning the "unselective portion" of the index during the trial period, causing plan selection to correctly choose the SORT plan. When the $in list is greater than or equal to 101, the trial period terminates without ever looking at the "unselective portion" of the index. This is entirely consistent with SERVER-20616. | |||||||||||||||||||
| Comment by Johnny Shields [ 16/Nov/23 ] | |||||||||||||||||||
|
Hi David, thank you for the detailed response: > This makes it clear that you are testing against production data. But is this also a production query?
> Would you be able to retrieve "allPlansExecution" verbosity explain I was actually never able to get allPlansExecution to actually finish, even after waiting ~6 hours with this query, because I have ~200M records in this collection. I can try on staging data but it might not be apples-to-apples. > Do let us know what you find with SBE disabled and provide us with the relevant explain output Will target to have this next week.
Some further thoughts:
| |||||||||||||||||||
| Comment by David Storch [ 16/Nov/23 ] | |||||||||||||||||||
|
Thanks for the additional details shields@tablecheck.com. The script which I provided above reproduces exactly the plan selection error that you demonstrated with the attached explain output. I can also reproduce the facts that you describe – namely, that the problem reproduces with 101 $in elements but not with 100 and that the problem does not reproduce when removing the sort. The difference between what I see and what you describe is that I can reproduce the exact same behavior on 6.0 (using the classic engine), as well as on 7.0 with SBE enabled and on 7.0 with SBE disabled. So as far as I can tell, this seems to be a duplicate of the well-known issue SERVER-20616. Note that SERVER-20616 affects both the SBE and classic plan selection implementations because they implement a very similar algorithm. I'd like to elaborate further on the nature of the problem, and why I believe SERVER-20616 is at play. For the purposes of discussion, let's just focus on two of the plans generated for the problematic query:
The plan selection process works by partially executing each of the two candidate plans until either one of them finishes or produces one full batch of results. It just so happens that as the bad plan starts scanning the {shop_id: 1, st: 1, deleted_at: 1} index, every single keyed document passes the $in filter. This is the case for the first 100 or so keys in the index, but virtually all other index keys encountered later in the index scan will be used to fetch a document that subsequently gets discarded. So in reality, the index scan is highly unselective. However, if the $in list has enough elements, the trial period will end before the plan starts examining this "unselective portion" of the index. As a result, the system estimates this bad index scan to be 100% selective and scores it highly. The SORT plan uses a selective index scan, but is disfavored because the plan costing function penalizes it for the cost of the SORT operation. Hopefully this is clear, but in essence the problem is that the first ~100 index keys examined are not representative of the data overall, and therefore confounds the plan selection algorithm. Again, this is a well-known problem that affects all versions of MongoDB and affects both the Classic Engine and SBE. Just to double-check that we are entirely on the same page here, about this:
This makes it clear that you are testing against production data. But is this also a production query? Does it seem correct to you that because of the way your application works, you would expect the documents matching the $in predicate to also be those documents which come first according to the sort order of {shop_id: 1, st: 1}? There is no doubt about why the particular explain output that you attached is choosing the wrong plan. I understand, however, that this might not the full picture because you observed a difference in query performance between 6.0 and 7.0. To help debug a potential difference between 6.0 and 7.0 in more detail, would you be able to retrieve "allPlansExecution" verbosity explain output from an application query that is impacted? I'm not sure if it's possible for you to collect a "fast" explain from 6.0 and a "slow" explain from 7.0 to help us understand the difference. If not, then the 7.0 explain output for a query that you know definitively regressed relative to 6.0, ideally with a slow query log line associated with the query running against 6.0. If needed, we can provide a private/secure way for you to upload this information for us to examine.
Sounds good. Do let us know what you find with SBE disabled and provide us with the relevant explain output. | |||||||||||||||||||
| Comment by Johnny Shields [ 15/Nov/23 ] | |||||||||||||||||||
|
Hi David, to give more color on this:
As next steps:
| |||||||||||||||||||
| Comment by David Storch [ 13/Nov/23 ] | |||||||||||||||||||
|
Hi shields@tablecheck.com! I've had a chance to examine the attached explain output. Unfortunately, I don't think there's enough detail for us to determine the root cause, so I'd like to gather some additional information so that our Query Engineering team can understand what's going wrong.
As a final note, I should clarify that the plan selection problem in the attached explain output is a well-known problem. See SERVER-20616 for more details. As a more concrete example motivated by this bug report, I wrote the following script to demonstrate the problem. In this example the _id and a fields are perfectly correlated.
This script causes the system to choose a bad plan (a full index scan of index {a: 1}) for both SBE and the Classic Engine. I tested it against 7.0, but I would expect it to reproduce on pretty much all versions of MongoDB. However, since you are experiencing this problem specifically on upgrade to 7.0, it's not clear to me that this script captures the full story about what's going on in your environment – that's why I'm requesting additional information to dig in a bit further. | |||||||||||||||||||
| Comment by Johnny Shields [ 29/Oct/23 ] | |||||||||||||||||||
|
Something very interesting here, if I set up to 100 members on the $in criteria vs 101+ members, the Query Planner selects different indexes. Therefore, it seems that 100 is a "magic number" in the code. I have attached the explain() results for:
Unfortunately the 101+ member case is so slow that I had to kill the op before it could complete, but you can see the 100 member case selects the built-in _id_1 index, while the 101 case incorrectly selects shop_id_1_st_1_deleted_at_1. It seems that it's getting this due to the sort. |