Uploaded image for project: 'Python Integrations'
  1. Python Integrations
  2. INTPYTHON-230

Improper handling of documents with empty embedded arrays

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Unknown Unknown
    • pymongoarrow-1.6
    • Affects Version/s: None
    • Component/s: None
    • None
    • Python Drivers
    • Hide

      1. What would you like to communicate to the user about this feature?
      2. Would you like the user to see examples of the syntax and/or executable code and its output?
      3. Which versions of the driver/connector does this apply to?

      Show
      1. What would you like to communicate to the user about this feature? 2. Would you like the user to see examples of the syntax and/or executable code and its output? 3. Which versions of the driver/connector does this apply to?

      As reported in https://github.com/mongodb-labs/mongo-arrow/issues/208.

      We should be able to handle the following:

      from pymongo import MongoClient
      
      import pymongoarrow.api as pmaapi
      
      import pyarrow.parquet as papq
      import pyarrow.json as pajson
      
      import io
      import json
      import bson
      
      
      client = MongoClient()
      collection = client.testdb.data;
      collection.drop();
      
      client.testdb.data.insert_many([
          { '_id': 1, 'foo':  { 'bar': ['1','2'] } },
          { '_id': 2, 'foo':  { 'bar': [] } }
      ])
      
      # get document out of mongo, put it in a file and read it with pyarrow and write it to parquet
      doc1 = client.testdb.data.find_one({'_id': 1})
      string1 = bson.json_util.dumps(doc1, indent = 2) 
      file1 = io.BytesIO(bytes(string1, encoding='utf-8'))
      papatable1 = pajson.read_json(file1)
      print(str(papatable1))
      papq.write_table(papatable1, 'pyarrow' + str(1) + '.parquet')
      
      # read document with pymongoarrow and write it to parquet
      pmapatable1 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 1}})
      print(str(pmapatable1))
      papq.write_table(pmapatable1, 'pymongoarrow' + str(1) + '.parquet')
      
      
      
      doc2 = client.testdb.data.find_one({'_id': 2})
      string2 = bson.json_util.dumps(doc2, indent = 2) 
      file2 = io.BytesIO(bytes(string2, encoding='utf-8'))
      papatable2 = pajson.read_json(file2)
      print(str(papatable2))
      papq.write_table(papatable2, 'pyarrow' + str(2) + '.parquet')
      
      pmapatable2 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 2}})
      papq.write_table(pmapatable2, 'pymongoarrow' + str(2) + '.parquet')
      

      Instead it produces:

      $ python repro.py
      pyarrow.Table
      _id: int64
      foo: struct<bar: list<item: string>>
        child 0, bar: list<item: string>
            child 0, item: string
      ----
      _id: [[1]]
      foo: [
        -- is_valid: all not null
        -- child 0 type: list<item: string>
      [["1","2"]]]
      pyarrow.Table
      _id: int32
      foo: struct<bar: list<item: string>>
        child 0, bar: list<item: string>
            child 0, item: string
      ----
      _id: [[1]]
      foo: [
        -- is_valid: all not null
        -- child 0 type: list<item: string>
      [["1","2"]]]
      pyarrow.Table
      _id: int64
      foo: struct<bar: list<item: null>>
        child 0, bar: list<item: null>
            child 0, item: null
      ----
      _id: [[2]]
      foo: [
        -- is_valid: all not null
        -- child 0 type: list<item: null>
      [0 nulls]]
      Traceback (most recent call last):
        File "/workspaces/vscode-python/pymongoarrow/repro.py", line 45, in <module>
          pmapatable2 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 2}})
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/home/vscode/Envs/pma1/lib/python3.11/site-packages/pymongoarrow/api.py", line 112, in find_arrow_all
          process_bson_stream(batch, context)
        File "pymongoarrow/lib.pyx", line 159, in pymongoarrow.lib.process_bson_stream
        File "pymongoarrow/lib.pyx", line 246, in pymongoarrow.lib.process_raw_bson_stream
        File "pymongoarrow/lib.pyx", line 133, in pymongoarrow.lib.extract_document_dtype
        File "pymongoarrow/lib.pyx", line 108, in pymongoarrow.lib.extract_field_dtype
        File "pyarrow/types.pxi", line 4452, in pyarrow.lib.list_
      TypeError: List requires DataType or Field
      

            Assignee:
            steve.silvester@mongodb.com Steve Silvester
            Reporter:
            steve.silvester@mongodb.com Steve Silvester
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: