[SERVER-78580] Improve $project parsing to avoid quadratic behavior Created: 30/Jun/23  Updated: 29/Oct/23  Resolved: 14/Aug/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0

Type: Improvement Priority: Major - P3
Reporter: Matt Boros Assignee: Matt Boros
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File figure_with_fix.png     PNG File screenshot-1.png    
Issue Links:
Problem/Incident
Related
related to SERVER-62509 Write tests to stress ABT and Bonsai Closed
Assigned Teams:
Query Optimization
Backwards Compatibility: Fully Compatible
Sprint: QE 2023-07-24, QE 2023-08-07, QE 2023-08-21
Participants:
Linked BF Score: 35

 Description   

ProjectionPathASTNode getChild performs a linear search through a vector of field names to find the correct child. Because of this search, we see n^2 behavior on the number of fields in the $project. As we add children to this node in addNodeAtPathHelper, we also call getChild to search through everything we've added so far.

ProjectionPathASTNode could use a map instead of a vector to avoid this behavior.

This ticket should also involve performance testing with a map. If we see a regression, we could consider using a map when there are >50 (or some N) elements.



 Comments   
Comment by Githook User [ 07/Aug/23 ]

Author:

{'name': 'Matt Boros', 'email': 'matt.boros@mongodb.com', 'username': 'mattBoros'}

Message: SERVER-78580 Improve $project parsing to avoid quadratic behavior
Branch: minh.luu-no_compile_sys-perf
https://github.com/mongodb/mongo/commit/196f02229d587c1e1212a3a21c6c670f29d75839

Comment by Githook User [ 03/Aug/23 ]

Author:

{'name': 'Matt Boros', 'email': 'matt.boros@mongodb.com', 'username': 'mattBoros'}

Message: SERVER-78580 Improve $project parsing to avoid quadratic behavior
Branch: master
https://github.com/mongodb/mongo/commit/196f02229d587c1e1212a3a21c6c670f29d75839

Comment by Githook User [ 01/Aug/23 ]

Author:

{'name': 'Matthew Boros', 'email': 'mattBoros@users.noreply.github.com', 'username': 'mattBoros'}

Message: SERVER-78580 projection parsing workload (#956)
Branch: master
https://github.com/mongodb/genny/commit/3ca962077cb08b0f42b5e8eeba13616c5b66666a

Comment by Matt Boros [ 27/Jul/23 ]

https://github.com/mongodb/genny/pull/956 for genny workload

Comment by Matt Boros [ 19/Jul/23 ]

Decided that SERVER-78631 could add this generic structure, because it doesn't fit cleanly into this ticket's scope. I'll post the code I had for this vector/map structure in a comment under SERVER-78631.

Comment by Matt Boros [ 19/Jul/23 ]

My plan for this was to create a generic map class templated on the key/value types and the type of map to use when the number of elements is greater than some threshold. Until that threshold is hit, a vector would be used. This would solve the root problem of this ticket which is ProjectionPathASTNode. It could also be used for SERVER-78631.

However ProjectionPathASTNode needs the inserted data to be kept in order, which is possible with a hash map where the keys are part of a linked list. This is not a data structure provided by the standard library. Our preferred way to get the linked hash map behavior is to use a boost::multi_index_container which does not have the same interface as a regular map, and therefore wouldn't work with the generic container I was thinking of creating. Similarly, the LRU Cache implementation we have maintains its own data structures rather than using a linked hash map.

I could
1. Give up on creating this generic data structure and fix the problem non-generically.
2. Write my own LinkedHashMap on top of an existing c++ map structure, then refactor lru_cache.h and the generic map structure I described to use this.
3. Find a library that provides a linked hash map. This would be excessive.

I think I will go with #2 but will confirm with someone first.

Comment by Matt Boros [ 11/Jul/23 ]

My plan is to use a vector for small projections and when enough nodes are added we switch over to a map. I did some investigation and found n=100 to be a good crossover point. Unfortunately there aren't any jstests in agg/ or core/ that test for this size projection so I'll add one. I'm considering adding a Genny workload as well.

Ran a perf patch with workloads we currently have, and as expected it's too noisy to make conclusions. I'll run another patch and test locally to see what's going on.

Here is what the curve looks like with the fix.

Comment by Matt Boros [ 07/Jul/23 ]

Thought about this and I don't think this ticket belongs in the Bonsai perf project. Will move it out later.

Comment by Matt Boros [ 30/Jun/23 ]

For the largest $projects you can create, of the form {$project: {a: 1, b: 1, c: 1, ...}}, this currently takes 30+ minutes to parse. With a proof of concept I have, it takes less than a minute.

Proof of concept: https://github.com/10gen/mongo/compare/master...project-parse-fast

Generated at Thu Feb 08 06:38:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.