-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Unknown
-
Affects Version/s: None
-
Component/s: None
-
None
-
Python Drivers
-
Not Needed
-
None
-
None
-
None
-
None
-
None
-
None
Bug Description
Problem: When using parallel batch processing (parallelism="threads" or
parallelism="processes"), PyMongoArrow fails when different batches have different schemas due
to type inference.
Specific scenario:
• First batch contains small integers (inferred as int32)
• Later batch contains large integers requiring int64
• The parallel code paths process each batch independently, creating separate Arrow tables
• When concatenating these tables with pa.concat_tables(), the schemas don't match (int32 vs
int64), causing an error
Root causes:
1. In `api.py`: Used promote_options="default" which requires exact schema matches and doesn't
allow type promotion
2. In `lib.pyx`: When promoting int32→int64 during schema inference, the old builder was
discarded, losing all previously appended int32 values
The fix:
1. Change promote_options="permissive" to allow type promotion when concatenating tables
2. Preserve existing int32 values by casting them to int64 and re-appending to the new int64
builder
This ensures parallel and non-parallel code paths produce consistent results when schema
inference encounters mixed integer sizes.