[SERVER-62150] SBE Multiplanning can be slow when suboptimal plan runs first Created: 17/Dec/21 Updated: 27/Jan/24 |
|
| Status: | Open |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 5.1.1, 5.2.0-rc1, 6.0.12, 7.0.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Mihai Andrei | Assignee: | Backlog - Query Execution |
| Resolution: | Unresolved | Votes: | 3 |
| Labels: | RDY | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Assigned Teams: |
Query Execution
|
||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | QE 2022-01-24, QO 2023-11-13, QO 2023-11-27 | ||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Currently, the strategy used in SBE multiplanning is as follows:
The problem with this approach is that if the first plan we run is not the optimal one, we are stuck running it and we can potentially use all of the reads. As an example, consider two plans, A and B. Plan A needs to perform 10k storage engine reads to get 101 documents, while plan B needs to perform 101 reads to get 101 documents. If Plan B runs first, we have no problems: we will set the reads limit for plan A to 101, and it will stop running after 101 reads. If Plan A runs first however, we will be stuck running plan A for all 10k reads. Though we’ll eventually run plan B and it will be chosen, this negatively impacts the performance of queries which need to use the multiplanner. |
| Comments |
| Comment by Johnny Shields [ 01/Dec/23 ] |
|
David, Ivan thank you both it means a lot to me as a customer that MongoDB is tackling these issues with high priority. |
| Comment by David Storch [ 01/Dec/23 ] |
|
To add onto what Ivan said, I'm going to move this ticket back to the "Open" state – we are no longer "Investigating" this issue, but rather are executing on an engineering project to fix the problem. The solution will be delivered against a sequence of related Jira tickets rather than developing directly against this ticket. However, we can provide high-level progress updates here. |
| Comment by Ivan Fefer [ 30/Nov/23 ] |
|
Yes, I am aware. We are working on a solution. However, it requires redesign of the whole multi planning process with SBE and will take some time to develop, test and release. To improve customer experience in the meantime, we are planning a change in default configuration via As well we are going to improve our testing process not to miss this again with the next SBE release. |
| Comment by Johnny Shields [ 30/Nov/23 ] |
|
ivan.fefer@mongodb.com are you also aware of this issue? This is another one we are seeing related to SBE. |
| Comment by Johnny Shields [ 08/Nov/23 ] |
|
Following from my report in I anticipate many CRUD apps on MongoDB will be affected by this. When we initially upgraded to MongoDB 7 ~3 weeks ago we saw it breaking our app in a number of critical places. Please give this issue the attention it deserves. |
| Comment by Ana Meza [ 12/Apr/22 ] |
|
Waiting on other tickets first |
| Comment by David Storch [ 07/Apr/22 ] |
|
Returning this to the triage queue. At the moment our efforts related to this problem fall under |
| Comment by David Storch [ 14/Feb/22 ] |
|
Another quick update. We have filed two additional offshoot tickets:
Folks interested in this ticket may wish to watch these two new related ones. This ticket will continue to serve as the umbrella. There is no specific engineering work planned against the umbrella ticket at this time, but |
| Comment by David Storch [ 28/Jan/22 ] |
|
Related ticket |
| Comment by David Storch [ 25/Jan/22 ] |
|
The Query Team has been internally brainstorming several potential solutions to this problem. We have generated a handful of ideas of various implementation complexity which I will describe below, mostly for the benefit of query engineering. However, we think there is one simple change that we should implement immediately, which we expect should go a long way towards mitigating the problem described here: Once
|