Uploaded image for project: 'Python Integrations'
  1. Python Integrations
  2. INTPYTHON-250

Data Loss in PyMongoArrow when working with large volume of data

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Critical - P2 Critical - P2
    • pymongoarrow-1.6
    • Affects Version/s: None
    • Component/s: None
    • None
    • Python Drivers

      A PyMongoArrow user reported this issue in the forum:

      Hi, I am using pymongoarrow for data fetch automation in one of my projects and encountered a strange behaviour in how it fetches the results. For this specific project I need 2-3 month of historical data. If I put a period of 1 month in the query individually for each month there is no issues with the data, but if my query has a period of two month or more then there is a specific data loss pattern happening at random in some of the results. Specifically there null values start appearing in one of the nested fields of the collection called Item.Id, all of the other fields (Item.Price, Item.Tax, …) for the same _id are returned without issues. In total out of 8 million documents about 100000 has this type of problem. If I query only one of those problematic _id by putting it in the query the results are returned as normal (no null in Item.Id). Querying data in bson format directly from MongoDB using {{mongodump }}does not result in this issue appearing as well. I suspect there might be some issues when processing bson inside PyMongoArrow library. This issue happens in both find_arrow_all and aggregate_arrow_all functions. Please let me know if anyone had a similar issue or know what might be the cause of this behaviour.

       

            Assignee:
            steve.silvester@mongodb.com Steve Silvester
            Reporter:
            shubham.ranjan@mongodb.com Shubham Ranjan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: