Discovered in PYTHON-3502. GridOut.readline and GridOut.read are slow with large chunk sizes. I believe the problem is that both of these method copy the remainder of the chunk each time:
# Return 'size' bytes and store the rest. data.seek(size) self.__buffer = data.read() data.seek(0) return data.read(size)
So if only 100 bytes are going to be returned in a 255KB chunk, these methods will load the 255KB, copy 100 bytes as the return value, and also copy the remaining 255KB - 100 bytes to the self.__buffer. Shuffling the data around like this is extremely wasteful. We can calculate the wasted data copies for each chunk like this:
def shuffled_bytes(chunk_size, read_size): t = 0 size = chunk_size while size > 0: size = max(size - read_size, 0) t += size return t >>> shuffled_bytes(255*1024, 100) / (1024*1024) 325.00049591064453
The default chunk size with a read size of 100 bytes will result in a total of 325MB of shuffled memory when reading a single 255KB file!
I believe read and readline has always had this problem.
- related to
-
PYTHON-3502 GridFSBucket.download_to_stream slow
- Closed