-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: pymongoarrow
-
None
Context
From Sebastian (@sibbiii) on this Github Issue
We use MongoDB to store large amounts of data and as a consequence, also often query large amounts of data. The MongoDB server handles the load pretty well, but when it comes to fetching the result with mongo-arrow in Python there is one big bottleneck, and this seems to be BSON decoding, so this line here.
When i comment it out and just print the len() of the batch speed is as I expect it. Obviously, the BSON has to be decoded, and this takes CPU. Unfortunately, it seems that only one CPU core is used to decode BSON which is quite frustrating as modern libraries such as polars are running their calculations multi-core. Sure we made your way around this limitation, but the implementations are not clean.
Is there any chance to make the BSON decoding itself multi-core or at least the batch processing which looks like the perfect candidate for multicore decoding?
Definition of done
Our work is to investigate how viable it is to run several batches of process bson stream within a the aggregate_arrow_all call. Share the performance numbers. The results of the performance findings will decide if we spawn another ticket to submit the issue.
Pitfalls
Deciding how many threads to spawn may be the governing issue.
