-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Unknown
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
After the changes in PYTHON-1860, pymongoarrow is slower than the naive pd.DataFrame(list(coll.find())) approach.
Before PYTHON-1860:
$ python benchmark.py # With pymongo 3.11 (pre PYTHON-1860)
100000 small docs, 40 bytes each with 3 keys
1000 large docs, 153k each with 2600 keys
BENCH: SMALL LARGE
conventional-to-ndarray: 0.29 1.34
pymongoarrow-to-numpy: 0.11 1.31
conventional-to-pandas: 0.37 2.01
pymongoarrow-to-pandas: 0.11 1.28
pymongoarrow-to-arrow: 0.11 1.29
$ pip list
Package Version Editable project location
--------------- ---------- --------------------------------------------
Cython 0.29.22
numpy 1.20.1
pandas 1.2.3
pip 22.0.4
pyarrow 7.0.0
pymongo 3.11.4
pymongoarrow 0.4.0.dev0 /Users/shane/git/mongo-arrow/bindings/python
python-dateutil 2.8.1
pytz 2021.1
pyupgrade 2.13.0
setuptools 53.0.0
six 1.15.0
tokenize-rt 4.1.0
wheel 0.37.0
After PYTHON-1860:
$ pip install --upgrade 'pymongo<4'
...
$ python benchmark.py # With pymongo 3.12 (post PYTHON-1860)
100000 small docs, 40 bytes each with 3 keys
1000 large docs, 153k each with 2600 keys
BENCH: SMALL LARGE
conventional-to-ndarray: 0.29 1.29
pymongoarrow-to-numpy: 0.30 1.76
conventional-to-pandas: 0.36 2.29
pymongoarrow-to-pandas: 0.39 2.11
pymongoarrow-to-arrow: 0.31 1.76
One way to fix this would be for the server to finally implement OP_MSG Payload Type 1 stream responses.
- depends on
-
PYTHON-2722 Improve performance of find/aggregate_raw_batches
-
- Closed
-
- is caused by
-
PYTHON-1860 Use OP_MSG not OP_GET_MORE in find_raw_batches and aggregate_raw_batches
-
- Closed
-
- is depended on by
-
INTPYTHON-101 Use pymongoarrow in dask-mongo
-
- Backlog
-