Uploaded image for project: 'Python Driver'
  1. Python Driver
  2. PYTHON-1513

PyMongo inefficiently reads large messages off the network

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.7
    • Component/s: Internal
    • Labels:
    • Environment:
      Windows 7 x64
      Cygwin 32
      Python 3.6.4
      pymongo 3.6.1
    • Backwards Compatibility:
      Fully Compatible

      Description

      PyMongo inefficiently reads and assembles large messages off the network in network._receive_data_on_socket:

      network.py

      166
      def _receive_data_on_socket(sock, length):
      167
          msg = b""
      168
          while length:
      169
              try:
      170
                  chunk = sock.recv(length)
      171
              except (IOError, OSError) as exc:
      172
                  if _errno_from_exception(exc) == errno.EINTR:
      173
                      continue
      174
                  raise
      175
              if chunk == b"":
      176
                  raise AutoReconnect("connection closed")
      177
       
      178
              length -= len(chunk)
      179
              msg += chunk
      180
       
      181
          return msg
      

      The biggest problem here is on line 179 where the chunk of recv'd data is appended to the full message using bytes +=. This is relatively efficient only on CPython 2 because str has an optimized version of +=. This optimization is not present on PyPy 2 and Python 3. Performance on Python 3 (and PyPy 2) suffer when assembling large messages. For example when reading large batches of documents:

      Comparison of Python 2.7 and 3.6 when reading large messages with PyMongo 3.6.1

      $ python2.7 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
      10 loops, best of 3: 428 msec per loop
      $ python3.6 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
      10 loops, best of 3: 1.32 sec per loop
      

      The slowdown becomes worse when recv'ing many small chunks of data to assemble a large message. One can simulate this by configuring a small TCP recv buffer:

      $ sudo sysctl net.inet.tcp.recvspace=16384
      net.inet.tcp.recvspace: 131072 -> 16384
      $ sudo sysctl net.inet.tcp.autorcvbufmax=262144
      net.inet.tcp.autorcvbufmax: 1048576 -> 262144
      $ python2.7 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
      10 loops, best of 3: 416 msec per loop
      $ python3.6 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
      10 loops, best of 3: 3.48 sec per loop
      

      The fix is to preallocate a bytearray and copy each chunk using slice assignment. On Python 3 we can do even better by passing a memoryview of the bytearray to Socket.recv_into. Here is the same benchmark with these improvements:

      $ python2.7 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
      10 loops, best of 3: 384 msec per loop
      $ python3.6 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
      10 loops, best of 3: 194 msec per loop
      

      Much better!

        • Original description **

      Cursor hangs after reading 101 documents from MongoDB.

      Cursor gets stuck for a few seconds after reading 101 documents.
      This happens again after reading 5776, 11451 documents.

      My collection has 30.000 document, each 90 fields long (just attribute:value, only strings)

      This problem only occurres in Python, using pymongo.
      It does not occur when using mongo-shell.

      #!/usr/bin/python3
       
      from pymongo import MongoClient
       
      entriestoprocess = 15000
      rowCount=1
       
      mongo_client=MongoClient()
      db=mongo_client.test2
       
      cursor = db.wr.find()
       
      for i in cursor:
        if (rowCount > entriestoprocess):
          print("Finished after {} entries".format(rowCount))
          break
        print('----------------------------------------')
        print("Processing entry: {}".format(rowCount))
        rowCount += 1
      
      

        Attachments

        1. benchmark-cursor-batch.py
          2 kB
        2. benchmark-decoding.py
          3 kB

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: