Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 3.7
Affects Version/s: None
Component/s: Internal
Labels:
- cursor
- pymongo
Environment:
Windows 7 x64
Cygwin 32
Python 3.6.4
pymongo 3.6.1

Confidence Status:
None

Backwards Compatibility:
Fully Compatible

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

PyMongo inefficiently reads and assembles large messages off the network in network._receive_data_on_socket:

network.py

def _receive_data_on_socket(sock, length):
    msg = b""
    while length:
        try:
            chunk = sock.recv(length)
        except (IOError, OSError) as exc:
            if _errno_from_exception(exc) == errno.EINTR:
                continue
            raise
        if chunk == b"":
            raise AutoReconnect("connection closed")

        length -= len(chunk)
        msg += chunk

    return msg

The biggest problem here is on line 179 where the chunk of recv'd data is appended to the full message using bytes +=. This is relatively efficient only on CPython 2 because str has an optimized version of +=. This optimization is not present on PyPy 2 and Python 3. Performance on Python 3 (and PyPy 2) suffer when assembling large messages. For example when reading large batches of documents:

Comparison of Python 2.7 and 3.6 when reading large messages with PyMongo 3.6.1

$ python2.7 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
10 loops, best of 3: 428 msec per loop
$ python3.6 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
10 loops, best of 3: 1.32 sec per loop

The slowdown becomes worse when recv'ing many small chunks of data to assemble a large message. One can simulate this by configuring a small TCP recv buffer:

$ sudo sysctl net.inet.tcp.recvspace=16384
net.inet.tcp.recvspace: 131072 -> 16384
$ sudo sysctl net.inet.tcp.autorcvbufmax=262144
net.inet.tcp.autorcvbufmax: 1048576 -> 262144
$ python2.7 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
10 loops, best of 3: 416 msec per loop
$ python3.6 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
10 loops, best of 3: 3.48 sec per loop

The fix is to preallocate a bytearray and copy each chunk using slice assignment. On Python 3 we can do even better by passing a memoryview of the bytearray to Socket.recv_into. Here is the same benchmark with these improvements:

$ python2.7 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
10 loops, best of 3: 384 msec per loop
$ python3.6 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
10 loops, best of 3: 194 msec per loop

Much better!

- Original description **

Cursor hangs after reading 101 documents from MongoDB.

Cursor gets stuck for a few seconds after reading 101 documents.
This happens again after reading 5776, 11451 documents.

My collection has 30.000 document, each 90 fields long (just attribute:value, only strings)

This problem only occurres in Python, using pymongo.
It does not occur when using mongo-shell.

#!/usr/bin/python3

from pymongo import MongoClient

entriestoprocess = 15000
rowCount=1

mongo_client=MongoClient()
db=mongo_client.test2

cursor = db.wr.find()

for i in cursor:
  if (rowCount > entriestoprocess):
    print("Finished after {} entries".format(rowCount))
    break
  print('----------------------------------------')
  print("Processing entry: {}".format(rowCount))
  rowCount += 1

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

benchmark-cursor-batch.py
2 kB
Mar 29 2018 07:06:29 PM UTC
benchmark-decoding.py
3 kB
Mar 31 2018 12:24:40 AM UTC

is related to

RUBY-1364 Bad performance reading large documents over SSL

Closed

related to

PYTHON-413 MemoryError while retrieving large cursors

Closed

Assignee:: Shane Harvey
Reporter:: Andreas S.
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Mar 26 2018 11:57:31 AM UTC
Updated:: Oct 29 2023 02:30:43 AM UTC
Resolved:: May 09 2018 03:06:08 AM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates