-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Unknown
-
None
-
Component/s: GridFS, Performance
-
None
-
Needed
-
Summary
GridFS chunks upload should use insertMany/bulkWrite for better performance. This is helpful for large files or small chunk sizes where many chunks will be inserted especially now that the default "majority" write concern an each write has higher latency.
We made this change in PYTHON-4146 which increased gridfs upload speed in one Atlas benchmark from 46 MB/s to 200 MB/s.
Motivation
Most drivers use insertOne to upload each chunk as data is being written to the GridFS stream. This is a simple but results in poor performance when multiple chunks could have been batched in a single insertMany call.
Who is the affected end user?
GridFS users.
How does this affect the end user?
Slow upload performance.
How likely is it that this problem or use case will occur?
Likely, especially with the default majority write concern.
Is this ticket required by a downstream team?
No.
Is this ticket only for tests?
No.
Acceptance Criteria
Update the GridFS spec to require insertMany/bulkWrite to be used when uploading chunks. Include a test to ensure a driver batches writes. We also need to add a test to ensure the bug in CSHARP-4900 is not introduced.
Note: PyMongo batches at 32MB or 100,000 chunks, since the objective is to fill up a single OP_MSG as much as possible. While the driver could theoretically batch up to 48MB (maxMessageSizeBytes) we decided to use a smaller limit to simplify the implementation. We could reevaluate this decision when implementing this ticket.
- related to
-
CSHARP-4900 Uploading a duplicate file larger than the original one causes errors in downloading the original file
- Closed
-
PYTHON-4146 Use insert_many to upload GridFS chunks for better performance
- Closed
-
JAVA-1918 Batch writes to GridFS
- Backlog
- split to
-
PHPLIB-1376 Use insertMany to upload GridFS chunks for better performance
- Blocked