Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: pymongoarrow
Labels:
None

Quarter:
- FY26Q4
Confidence Status:
None

Assigned Teams:

Python Drivers

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

Context

From Sebastian (@sibbiii) on this Github Issue

We use MongoDB to store large amounts of data and as a consequence, also often query large amounts of data. The MongoDB server handles the load pretty well, but when it comes to fetching the result with mongo-arrow in Python there is one big bottleneck, and this seems to be BSON decoding, so this line here.

When i comment it out and just print the len() of the batch speed is as I expect it. Obviously, the BSON has to be decoded, and this takes CPU. Unfortunately, it seems that only one CPU core is used to decode BSON which is quite frustrating as modern libraries such as polars are running their calculations multi-core. Sure we made your way around this limitation, but the implementations are not clean.

Is there any chance to make the BSON decoding itself multi-core or at least the batch processing which looks like the perfect candidate for multicore decoding?

Definition of done

Our work is to investigate how viable it is to run several batches of process bson stream within a the aggregate_arrow_all call. Share the performance numbers. The results of the performance findings will decide if we spawn another ticket to submit the issue.

Pitfalls

Deciding how many threads to spawn may be the governing issue.

Assignee:: Iris Ho
Reporter:: Jib Adegunloye
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Oct 24 2025 07:16:44 PM UTC
Updated:: Nov 14 2025 08:10:14 PM UTC

Details

Description

Context

Definition of done

Pitfalls

Attachments

Activity

People

Dates