Uploaded image for project: 'Python Driver'
  1. Python Driver
  2. PYTHON-1513

PyMongo inefficiently reads large messages off the network

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 3.7
    • Affects Version/s: None
    • Component/s: Internal
    • Environment:
      Windows 7 x64
      Cygwin 32
      Python 3.6.4
      pymongo 3.6.1
    • Fully Compatible

      PyMongo inefficiently reads and assembles large messages off the network in network._receive_data_on_socket:

      network.py
      def _receive_data_on_socket(sock, length):
          msg = b""
          while length:
              try:
                  chunk = sock.recv(length)
              except (IOError, OSError) as exc:
                  if _errno_from_exception(exc) == errno.EINTR:
                      continue
                  raise
              if chunk == b"":
                  raise AutoReconnect("connection closed")
      
              length -= len(chunk)
              msg += chunk
      
          return msg
      

      The biggest problem here is on line 179 where the chunk of recv'd data is appended to the full message using bytes +=. This is relatively efficient only on CPython 2 because str has an optimized version of +=. This optimization is not present on PyPy 2 and Python 3. Performance on Python 3 (and PyPy 2) suffer when assembling large messages. For example when reading large batches of documents:

      Comparison of Python 2.7 and 3.6 when reading large messages with PyMongo 3.6.1
      $ python2.7 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
      10 loops, best of 3: 428 msec per loop
      $ python3.6 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
      10 loops, best of 3: 1.32 sec per loop
      

      The slowdown becomes worse when recv'ing many small chunks of data to assemble a large message. One can simulate this by configuring a small TCP recv buffer:

      $ sudo sysctl net.inet.tcp.recvspace=16384
      net.inet.tcp.recvspace: 131072 -> 16384
      $ sudo sysctl net.inet.tcp.autorcvbufmax=262144
      net.inet.tcp.autorcvbufmax: 1048576 -> 262144
      $ python2.7 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
      10 loops, best of 3: 416 msec per loop
      $ python3.6 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
      10 loops, best of 3: 3.48 sec per loop
      

      The fix is to preallocate a bytearray and copy each chunk using slice assignment. On Python 3 we can do even better by passing a memoryview of the bytearray to Socket.recv_into. Here is the same benchmark with these improvements:

      $ python2.7 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
      10 loops, best of 3: 384 msec per loop
      $ python3.6 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d'
      10 loops, best of 3: 194 msec per loop
      

      Much better!

        • Original description **

      Cursor hangs after reading 101 documents from MongoDB.

      Cursor gets stuck for a few seconds after reading 101 documents.
      This happens again after reading 5776, 11451 documents.

      My collection has 30.000 document, each 90 fields long (just attribute:value, only strings)

      This problem only occurres in Python, using pymongo.
      It does not occur when using mongo-shell.

      #!/usr/bin/python3
      
      from pymongo import MongoClient
      
      entriestoprocess = 15000
      rowCount=1
      
      mongo_client=MongoClient()
      db=mongo_client.test2
      
      cursor = db.wr.find()
      
      for i in cursor:
        if (rowCount > entriestoprocess):
          print("Finished after {} entries".format(rowCount))
          break
        print('----------------------------------------')
        print("Processing entry: {}".format(rowCount))
        rowCount += 1
      
      

        1. benchmark-cursor-batch.py
          2 kB
        2. benchmark-decoding.py
          3 kB

            Assignee:
            shane.harvey@mongodb.com Shane Harvey
            Reporter:
            igel1 Andreas S.
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: