Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Fixed
Priority: Unknown
Fix Version/s: 4.3.3
Affects Version/s: None
Component/s: GridFS
Labels:
None

Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

Discovered in ~~PYTHON-3502~~. GridOut.readline and GridOut.read are slow with large chunk sizes. I believe the problem is that both of these method copy the remainder of the chunk each time:

        # Return 'size' bytes and store the rest.
        data.seek(size)
        self.__buffer = data.read()
        data.seek(0)
        return data.read(size)

So if only 100 bytes are going to be returned in a 255KB chunk, these methods will load the 255KB, copy 100 bytes as the return value, and also copy the remaining 255KB - 100 bytes to the self.__buffer. Shuffling the data around like this is extremely wasteful. We can calculate the wasted data copies for each chunk like this:

def shuffled_bytes(chunk_size, read_size):
    t = 0
    size = chunk_size
    while size > 0:
        size = max(size - read_size, 0)
        t += size
    return t

>>> shuffled_bytes(255*1024, 100) / (1024*1024)
325.00049591064453

The default chunk size with a read size of 100 bytes will result in a total of 325MB of shuffled memory when reading a single 255KB file!

I believe read and readline has always had this problem.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

bench-gridfs-PYTHON-3508.py
Nov 04 2022 11:04:43 PM UTC
1.0 kB
Shane Harvey

related to

PYTHON-3502 GridFSBucket.download_to_stream slow

Closed

Assignee:: Shane Harvey
Reporter:: Shane Harvey
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Nov 04 2022 07:20:15 PM UTC
Updated:: Oct 29 2023 02:28:06 AM UTC
Resolved:: Nov 07 2022 06:41:36 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates