-
Type: Bug
-
Resolution: Fixed
-
Priority: Unknown
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
After the changes in PYTHON-1860, pymongoarrow is slower than the naive pd.DataFrame(list(coll.find())) approach.
Before PYTHON-1860:
$ python benchmark.py # With pymongo 3.11 (pre PYTHON-1860) 100000 small docs, 40 bytes each with 3 keys 1000 large docs, 153k each with 2600 keys BENCH: SMALL LARGE conventional-to-ndarray: 0.29 1.34 pymongoarrow-to-numpy: 0.11 1.31 conventional-to-pandas: 0.37 2.01 pymongoarrow-to-pandas: 0.11 1.28 pymongoarrow-to-arrow: 0.11 1.29 $ pip list Package Version Editable project location --------------- ---------- -------------------------------------------- Cython 0.29.22 numpy 1.20.1 pandas 1.2.3 pip 22.0.4 pyarrow 7.0.0 pymongo 3.11.4 pymongoarrow 0.4.0.dev0 /Users/shane/git/mongo-arrow/bindings/python python-dateutil 2.8.1 pytz 2021.1 pyupgrade 2.13.0 setuptools 53.0.0 six 1.15.0 tokenize-rt 4.1.0 wheel 0.37.0
After PYTHON-1860:
$ pip install --upgrade 'pymongo<4'
...
$ python benchmark.py # With pymongo 3.12 (post PYTHON-1860)
100000 small docs, 40 bytes each with 3 keys
1000 large docs, 153k each with 2600 keys
BENCH: SMALL LARGE
conventional-to-ndarray: 0.29 1.29
pymongoarrow-to-numpy: 0.30 1.76
conventional-to-pandas: 0.36 2.29
pymongoarrow-to-pandas: 0.39 2.11
pymongoarrow-to-arrow: 0.31 1.76
One way to fix this would be for the server to finally implement OP_MSG Payload Type 1 stream responses.
- depends on
-
PYTHON-2722 Improve performance of find/aggregate_raw_batches
- Closed
- is caused by
-
PYTHON-1860 Use OP_MSG not OP_GET_MORE in find_raw_batches and aggregate_raw_batches
- Closed
- is depended on by
-
ARROW-101 Use pymongoarrow in dask-mongo
- Backlog