Uploaded image for project: 'pymongoarrow'
  1. pymongoarrow
  2. ARROW-10

Write support for tabular datatypes

    • Type: Icon: New Feature New Feature
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None

      Tabular datatypes (e.g. pandas.DataFrame, pyarrow.Table) are slow to iterate over and transform into documents that can be inserted into MongoDB. These datatypes are optimized for columnar operations and are fundamentally ill-suited for high-performance conversion to a sequence of documents that can be inserted/upserted into a MongoDB cluster.

      Furthermore, when dealing with a large dataset, not only is the write performance poor due to inefficiencies in transposing the columnar data, but it can also be severely degraded if documents are not sent in optimally-sized batches to the server (e.g. inserting one document at-a-time is very slow).

      We can significantly alleviate pain associated with the process of persisting tabular datasets from Python to a MongoDB cluster by writing a C-extension that:

      • iterates efficiently over the C-Arrays underlying columns that make up a table and encodes them directly to BSON
      • batches bulk writes automatically and optimally based on some heuristic to minimize the number of network round-trips needed to store the dataset

            Assignee:
            Unassigned Unassigned
            Reporter:
            prashant.mital Prashant Mital (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: