-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Replication
Looking into some performance issues with the new client bulk API in python I found that inserting a batch of 100,000 docs in client.bulk_write adds an extra 2.7MB of metadata just from the duplicate "insert" and "document" fields:
>>> doc = {} >>> client_bulk_op = {"insert": -1, "document": doc} >>> overhead = len(encode(client_bulk_op)) - len(encode(doc)) >>> overhead 27
Another example:
>>> client.bulk_write([InsertOne({}, namespace='test.test') for _ in range(100_000)]) client bulk OP_MSG size: 4900159 ClientBulkWriteResult(...) >>> client.test.test.bulk_write([InsertOne({}) for _ in range(100_000)]) collection bulk OP_MSG size: 2200118 BulkWriteResult(...)
That's 4.9 MB for client.bulk vs 2.2MB for collection.bulk.
Is it possible to improve this? Or at least minimize the duplication? For example using short field names save 1.2MB here:
>>> client_bulk_op = {"i": -1, "d": doc} >>> len(encode(client_bulk_op)) - len(encode(doc)) 15
Here's an example of an 18% decrease in performance in client.bulk_write() vs collection.bulk_write():
$ TEST_PATH=specifications/source/benchmarking/data OUTPUT_FILE=result.txt python test/performance/perf_test.py -v TestSmallDocBulkInsert TestSmallDocClientBulkInsert runTest (__main__.TestSmallDocBulkInsert.runTest) ... Completed TestSmallDocBulkInsert 23.822 MB/s, MEDIAN=0.105s, total time=30.098s, iterations=230 ok runTest (__main__.TestSmallDocClientBulkInsert.runTest) ... Completed TestSmallDocClientBulkInsert 19.355 MB/s, MEDIAN=0.129s, total time=30.143s, iterations=199 ok $ python >>> 1-(19.355/23.822) 0.18751574175132224
When I update the collection.bulk benchmark to append a 2.7MB comment field, TestSmallDocBulkInsert decreases to 21.615 MB/s. So it does seem likely the extra overhead is a big factor.
- related to
-
PYTHON-3233 Improved Bulk Write API
- Development Complete
-
JAVA-5545 Benchmark Collection and Client bulkWrite
- Backlog
-
DRIVERS-2862 Benchmark Collection and Client bulkWrite
- Implementing
-
DRIVERS-2954 Support the new Bulk Command use at the Database and Collection levels.
- Backlog