Uploaded image for project: 'Python Driver'
  1. Python Driver
  2. PYTHON-3508

GridOut.readline and GridOut.read are slow with large chunk sizes

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Unknown Unknown
    • 4.3.3
    • Affects Version/s: None
    • Component/s: GridFS
    • Labels:
      None

      Discovered in PYTHON-3502. GridOut.readline and GridOut.read are slow with large chunk sizes. I believe the problem is that both of these method copy the remainder of the chunk each time:

              # Return 'size' bytes and store the rest.
              data.seek(size)
              self.__buffer = data.read()
              data.seek(0)
              return data.read(size)
      

      So if only 100 bytes are going to be returned in a 255KB chunk, these methods will load the 255KB, copy 100 bytes as the return value, and also copy the remaining 255KB - 100 bytes to the self.__buffer. Shuffling the data around like this is extremely wasteful. We can calculate the wasted data copies for each chunk like this:

      def shuffled_bytes(chunk_size, read_size):
          t = 0
          size = chunk_size
          while size > 0:
              size = max(size - read_size, 0)
              t += size
          return t
      
      >>> shuffled_bytes(255*1024, 100) / (1024*1024)
      325.00049591064453 
      

      The default chunk size with a read size of 100 bytes will result in a total of 325MB of shuffled memory when reading a single 255KB file!

      I believe read and readline has always had this problem.

            Assignee:
            shane.harvey@mongodb.com Shane Harvey
            Reporter:
            shane.harvey@mongodb.com Shane Harvey
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: