Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-93922

Reduce metadata overhead for bulkWrite command

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Replication

      Looking into some performance issues with the new client bulk API in python I found that inserting a batch of 100,000 docs in client.bulk_write adds an extra 2.7MB of metadata just from the duplicate "insert" and "document" fields:

      >>> doc = {}
      >>> client_bulk_op = {"insert": -1, "document": doc}
      >>> overhead = len(encode(client_bulk_op)) - len(encode(doc))
      >>> overhead
      27
      

      Another example:

      >>> client.bulk_write([InsertOne({}, namespace='test.test') for _ in range(100_000)])
      client bulk OP_MSG size: 4900159
      ClientBulkWriteResult(...)
      >>> client.test.test.bulk_write([InsertOne({}) for _ in range(100_000)])
      collection bulk OP_MSG size: 2200118
      BulkWriteResult(...)
      

      That's 4.9 MB for client.bulk vs 2.2MB for collection.bulk.
      Is it possible to improve this? Or at least minimize the duplication? For example using short field names save 1.2MB here:

      >>> client_bulk_op = {"i": -1, "d": doc}
      >>> len(encode(client_bulk_op)) - len(encode(doc))
      15
      

      Here's an example of an 18% decrease in performance in client.bulk_write() vs collection.bulk_write():

      $ TEST_PATH=specifications/source/benchmarking/data OUTPUT_FILE=result.txt python test/performance/perf_test.py -v TestSmallDocBulkInsert TestSmallDocClientBulkInsert
      runTest (__main__.TestSmallDocBulkInsert.runTest) ... Completed TestSmallDocBulkInsert 23.822 MB/s, MEDIAN=0.105s, total time=30.098s, iterations=230
      ok
      runTest (__main__.TestSmallDocClientBulkInsert.runTest) ... Completed TestSmallDocClientBulkInsert 19.355 MB/s, MEDIAN=0.129s, total time=30.143s, iterations=199
      ok
      $ python
      >>> 1-(19.355/23.822)
      0.18751574175132224
      

      When I update the collection.bulk benchmark to append a 2.7MB comment field, TestSmallDocBulkInsert decreases to 21.615 MB/s. So it does seem likely the extra overhead is a big factor.

            Assignee:
            Unassigned Unassigned
            Reporter:
            shane.harvey@mongodb.com Shane Harvey
            Votes:
            1 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated: