Loading...

XML

Word

Printable

JSON

Type: New Feature
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Tabular datatypes (e.g. pandas.DataFrame, pyarrow.Table) are slow to iterate over and transform into documents that can be inserted into MongoDB. These datatypes are optimized for columnar operations and are fundamentally ill-suited for high-performance conversion to a sequence of documents that can be inserted/upserted into a MongoDB cluster.

Furthermore, when dealing with a large dataset, not only is the write performance poor due to inefficiencies in transposing the columnar data, but it can also be severely degraded if documents are not sent in optimally-sized batches to the server (e.g. inserting one document at-a-time is very slow).

We can significantly alleviate pain associated with the process of persisting tabular datasets from Python to a MongoDB cluster by writing a C-extension that:

iterates efficiently over the C-Arrays underlying columns that make up a table and encodes them directly to BSON
batches bulk writes automatically and optimally based on some heuristic to minimize the number of network round-trips needed to store the dataset

duplicates

ARROW-4 Support for writing tabular data to MongoDB

Closed

Assignee:: Unassigned

Reporter:: Prashant Mital (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: Mar 01 2021 09:22:40 PM UTC

Updated:: Apr 30 2022 04:45:42 AM UTC

Resolved:: Jan 13 2022 10:05:01 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates