Uploaded image for project: 'pymongoarrow'
  1. pymongoarrow
  2. ARROW-4

Support for writing tabular data to MongoDB

    • Type: Icon: Epic Epic
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 0.4.0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
    • 2
    • 2
    • 4
    • 200
    • Hide

      Engineer: Julius Park

      2022-04-05: Setting end date to 2022-04-15

      • Julius has continued to make strong progress - he has successfully added basic support for writing to MongoDB from PyArrow and has been refining the solution while running performance tests - with the most recent developments the performance hit appears to be negligible.

      Engineer: Julius Park

      2022-03-22: Setting end date to 2022-04-01

      • The team has signed off on the design and Julius has begun implementation.
      • Julius has round-trip testing working and he's also implemented row-specific error messages.

      Engineer: Julius Park

      2022-03-08: Start date pending design completion

      • Julius speedily achieved approval on his scope doc and has equally quickly stubbed out a design draft and suggestions for ticket breakdown. The team is deep into reviewing the doc.
      Show
      Engineer: Julius Park 2022-04-05: Setting end date to 2022-04-15 Julius has continued to make strong progress - he has successfully added basic support for writing to MongoDB from PyArrow and has been refining the solution while running performance tests - with the most recent developments the performance hit appears to be negligible. Engineer: Julius Park 2022-03-22: Setting end date to 2022-04-01 The team has signed off on the design and Julius has begun implementation. Julius has round-trip testing working and he's also implemented row-specific error messages. Engineer: Julius Park 2022-03-08: Start date pending design completion Julius speedily achieved approval on his scope doc and has equally quickly stubbed out a design draft and suggestions for ticket breakdown. The team is deep into reviewing the doc.

      Tabular datatypes (e.g. pandas.DataFrame, pyarrow.Table) are slow to iterate over and transform into documents that can be inserted into MongoDB. These datatypes are optimized for columnar operations and are fundamentally ill-suited for high-performance conversion to a sequence of documents that can be inserted/upserted into a MongoDB cluster.

      Furthermore, when dealing with a large dataset, not only is the write performance poor due to inefficiencies in transposing the columnar data, but it can also be severely degraded if documents are not sent in optimally-sized batches to the server (e.g. inserting one document at-a-time is very slow).

      We can significantly alleviate pain associated with the process of persisting tabular datasets from Python to a MongoDB cluster by writing a C-extension that:

      iterates efficiently over the C-Arrays underlying columns that make up a table and encodes them directly to BSON
      batches bulk writes automatically and optimally based on some heuristic to minimize the number of network round-trips needed to store the dataset

      Note: previously this ticked tracked implementation of write support in BSON-NumPy

            Assignee:
            julius.park@mongodb.com Julius Park (Inactive)
            Reporter:
            rathi.gnanasekaran Rathi Gnanasekaran
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: