[SERVER-83671] Implement lookup tries for IDL parse in C++ Created: 28/Nov/23 Updated: 05/Feb/24 |
|
| Status: | In Progress |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Billy Donahue | Assignee: | Billy Donahue |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | perf-tiger | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Service Arch
|
||||||||||||||||||||
| Sprint: | Service Arch 2024-01-22, Service Arch 2024-02-05, Service Arch 2024-02-19 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Currently IDL uses tries, rendered into C++ as custom find functions generated for each hardcoded trie instance table, encoding the table's contents into the code as literals. We should be able to do something much more constexpr(?) C++ container-like. maybe a third_party solution. Sample API for a minimal wrapper to be called from Python code:
From that point on, _fieldNames is accessed by Python to accept StringData name and return the FieldId of the field having that StringData as a name. We can let the definition of exactly what FieldNameMap<T> is be an implementation detail in the idl_parser.h header. It's probably a thin wrapper around a FieldNameMap<size_t>, which is a thin wrapper around some kind of (TBD) PerfectlyHashedMap<StringData, size_t> or Trie<size_t>. It's pretty much a minimal associative unordered container's API, specifically made for this kind of search pattern (and const initialization) as an internal implementation choice. We already have FieldIds defined inside the parse functions (see attached trie.txt file) but they're not in an enum yet. We could just do that change, assigning each struct field an enumeration in the closed enum set. |
| Comments |
| Comment by Billy Donahue [ 04/Dec/23 ] | ||||||||
|
(Recording for future reference) I just wanted some class to use in the python code, but hadn't yet written the python side. But now the Python side has been done and the results are interesting.
Restructuring the code so that the hardoded generator is a function call and changing little else, already costs 503 - 444 = 59ns. A rather simple dynamic implementation is not too much worse than hardcoding, adding only another 527 - 503 = 24ns. Future work might be:
| ||||||||
| Comment by Amirsaman Memaripour [ 28/Nov/23 ] | ||||||||
|
Linking the hash-based PoC here, per triage discussion: | ||||||||
| Comment by Jason Chan [ 28/Nov/23 ] | ||||||||
|
billy.donahue@mongodb.com and george.wangensteen@mongodb.com also mentioned that continuing to rely on Python generation will impact compilation times. | ||||||||
| Comment by Billy Donahue [ 28/Nov/23 ] | ||||||||
|
The generation in Python tends to make the generated code hard to work with. I see some possible inefficiencies with the generated parser but it's hard to experiment as-is to improve that situation. |