-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: BSON, Performance
Use Case
As a... BSON user
I want... to deserialize large BSON latin strings quickly to avoid the overhead of per byte utf translations
So that... I can speed up my application
An idea from anna.henningsen@mongodb.com: Try viewing all of a BSON document as a string, find the beginning and end of strings within the view to speed up fetching long latin strings from documents.
User Experience
- What is the desired/expected outcome for the user once this ticket is implemented?
- Long strings are parsed faster
Dependencies
- upstream and/or downstream requirements and timelines to bear in mind
- None
Risks/Unknowns
- What could go wrong while implementing this change? (e.g., performance, inadvertent behavioral changes in adjacent functionality, existing tech debt, etc)
- Care must be taken to not misinterpret multibyte utf8 sequences
- Structurally the current deserializer may be difficult to work within given the recursive implementation. Refactors may be necessary in order to share the string view with the whole decoding process.
- Is there an opportunity for better cross-driver alignment or testing in this area?
- Possibly, if the performance improves we should share the approach with others if it is possible in their language, not necessarily something for the specs.
- Is there an opportunity to improve existing documentation on this subject?
- No
Acceptance Criteria
Implementation Requirements
- Attempt:
- viewing BSON document bytes as a JS string
- determine the offsets of a string start and end and take slices from that string as you parse the BSON
- validate the string does not contain multibyte sequences
Testing Requirements
- Check for correctness
- If a performance test does not exist for long strings add one to main first
Documentation Requirements
- None
Follow Up Requirements
- additional tickets to file, required releases, etc
- if node behavior differs/will differ from other drivers, confirm with dbx devs what standard to aim for and what plan, if any, exists to reconcile the diverging behavior moving forward
- Are there additional optimizations if the string only contains a small amount of multibyte characters?