[SERVER-53024] Evaluate simdjson for reading JSON files in MQL queries Created: 23/Nov/20  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Pawel Terlecki Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Query Execution
Participants:

 Description   

Atlas DataLake currently uses external data access agents (written in go) that parse data in various formats, convert to BSON and pass to the query processing process (written in cpp) over STDIN. For performance reasons, we are considering implementing parsing directly in cpp for the most common formats.

JSON is one of the most popular used by our customers. At the moment, we use an external parser based on xdg-go/jibby. The point of this investigation is to measure performance of parsing such files with simdjson.

We will model scanning files directly in the query processor with a new input MQL stage:

{$collection: {path: <local path>, format: <format>}}

we only consider format: 'json'.



 Comments   
Comment by Pawel Terlecki [ 23/Nov/20 ]

First implementation with simdjson reading from memory mapped files works well. Proper vendorizing of the library in our build system needs more time.

The input file may be a named pipe or the file may be downloaded from cloud storage, that's why it is critical to have a streaming interface to simdjson. After talking to the authors, Daniel Lemire and John Kaiser, it turns out the current position can be obtained from a parser, which allows for identifying the last incomplete document.

https://github.com/simdjson/simdjson/pull/1301

// Gives the current index in the input document in bytes.

document_stream stream = parser.parse_many(json,window);
for(auto i = stream.begin(); i != stream.end(); ++i)

{ auto doc = *i; size_t index = i.current_index(); }
Generated at Thu Feb 08 05:29:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.