[SERVER-79548] [SBE] optimize pulling multiple fields out of a large document Created: 31/Jul/23 Updated: 08/Dec/23 |
|
| Status: | Open |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Mickey Winters | Assignee: | Backlog - Query Execution |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Assigned Teams: |
Query Execution
|
||||||||||||||||||||||||||||
| Sprint: | QE 2023-08-21, QE 2023-09-04, QE 2023-09-18, QE 2023-10-02, QE 2023-10-16, QE 2023-10-30 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
in this group query we are grouping several fields with a common subpath
each field on the sub-object "foo" has their own traverse. Even worse if there was a common subpath after foo (eg "foo.bar.a", "foo.bar.b", "foo.bar.c") we traverse for the bar field multiple times, when it could just be done once. These n^2 lookups become more painful when documents are large and there are many paths to lookup. One solution is to try to generate plans that look more like this projection query.
this still however has a similar problem where each call to getField is done independently and can result in the full bson or subobject in the bson being scanned multiple times. there is logic in the classic engine here to avoid scanning a document multiple times if possible while doing an inclusion projection https://github.com/mongodb/mongo/blob/2e0259b3050e4c27d47e353222395d21bb80b9e4/src/mongo/db/exec/fastpath_projection_node.h#L88 |
| Comments |
| Comment by Mickey Winters [ 27/Nov/23 ] | ||||||||||||||||||||||||
|
alberto.massari@mongodb.com I think that is close to my original proposal and might have the same issue with multimap having to own its own contents and forcing copies. Also the status quo is O(PU) where P is how many values to pick from the document and U is how many keys there are in the document total. Ignoring the copies if we do what you proposed it would avoid having to deal with multiple return values but in exchange we only get O(U + P log(P)) instead of just O(U). | ||||||||||||||||||||||||
| Comment by Alberto Massari [ 27/Nov/23 ] | ||||||||||||||||||||||||
|
Chatting with projjal.chanda@mongodb.com we were thinking of having the getFields return directly a multiMap SBE value, and enabling getField to pick the requested label from the map (as it already supports reading form both bsonObject and Object) | ||||||||||||||||||||||||
| Comment by Mickey Winters [ 31/Oct/23 ] | ||||||||||||||||||||||||
|
sending to needs triage to figure out scheduling. This may need to be broken up into several tickets | ||||||||||||||||||||||||
| Comment by Mickey Winters [ 15/Aug/23 ] | ||||||||||||||||||||||||
|
Notes from some conversations with ian and martin so I don't forget. One idea was to do something roughly like this if you had a project for {"colors.green.airQuality": 1, "colors.blue.airQuality": 1, "colors.red.airQuality": 1}
Basically have a getFields expression to look up multiple fields on an object or common sub object. It wouldn't look up full paths, just one level of a path. if their are multiple fields to look up from the results of calling getFields then you would need another traverse and getFields call. In this example the idea was to return an array to deal with the fact that an expression in SBE can only return one value. Martin also pointed out that an issue with this is that arrays must own their contents so this would result in copies and that's not ideal. Martin proposed changing it so that expressions can return multiple values by pushing each one to the stack instead of just one result to the stack. We also need change project to assign the results to multiple slots. the traverse functions would also need a little change too. The plan then would look more like this.
One problem with this is the getFields function might want to construct a set to see if a field its currently iterating is one it needs to pull. constructing this set each time its evaluated could be expensive so this set should be constructed and then added to the runtime slots to avoid constructing it over and over again. The last problem which hasn't been very thoroughly thought out is convincing the stage builders to generate these plans. In bonsai there could be some rewrite that detects the common subpaths and collapses several projects into 1. We might need to add something like this as an end phase to the stage builders or maybe update the stage builders for project and group to specifically generate traverse's in this manner. one alternative mentioned was using multiple lamdas or having lambdas with multiple inputs
|