[PyMongoArrow] fails when different batches have different schemas due    to type inference.

XMLWordPrintableJSON

    • None
    • Python Drivers
    • Not Needed
    • None
    • None
    • None
    • None
    • None
    • None

         Bug Description

         Problem: When using parallel batch processing (parallelism="threads" or
         parallelism="processes"), PyMongoArrow fails when different batches have different schemas due
         to type inference.

         Specific scenario:
           • First batch contains small integers (inferred as int32)
           • Later batch contains large integers requiring int64
           • The parallel code paths process each batch independently, creating separate Arrow tables
           • When concatenating these tables with pa.concat_tables(), the schemas don't match (int32 vs
             int64), causing an error

         Root causes:
           1. In `api.py`: Used promote_options="default" which requires exact schema matches and doesn't
              allow type promotion
           2. In `lib.pyx`: When promoting int32→int64 during schema inference, the old builder was
              discarded, losing all previously appended int32 values

         The fix:
           1. Change promote_options="permissive" to allow type promotion when concatenating tables
           2. Preserve existing int32 values by casting them to int64 and re-appending to the new int64
              builder

         This ensures parallel and non-parallel code paths produce consistent results when schema
         inference encounters mixed integer sizes.

            Assignee:
            Casey Clements
            Reporter:
            Casey Clements
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: