Analyze and improve performance

    • Type: Epic
    • Resolution: Fixed
    • Priority: Unknown
    • 5.5.0
    • Affects Version/s: None
    • Component/s: Performance
    • None
    • Performance Improvements Phase 1
    • Java Drivers
    • Needed
    • Hide
      1. What would you like to communicate to the user about this feature?

      We should highlight the performance improvements introduced in version 5.5.0 of the driver (percentage-based metrics).

      2. Would you like the user to see examples of the syntax and/or executable code and its output?

      We can link to the driver and spec repository folders containing our benchmark suites for clarity, which were used to evaluate performance against server version 8.

      3. Which versions of the driver/connector does this apply to?

      These improvements apply to the Java and Kotlin sync drivers. The async driver should also benefit, as the changes were made in the shared driver core, although we don’t have specific metrics for async improvements.
      Show
      1. What would you like to communicate to the user about this feature? We should highlight the performance improvements introduced in version 5.5.0 of the driver (percentage-based metrics). 2. Would you like the user to see examples of the syntax and/or executable code and its output? We can link to the driver and spec repository folders containing our benchmark suites for clarity, which were used to evaluate performance against server version 8. 3. Which versions of the driver/connector does this apply to? These improvements apply to the Java and Kotlin sync drivers. The async driver should also benefit, as the changes were made in the shared driver core, although we don’t have specific metrics for async improvements.
    • Done
    • 4
    • 4.1
    • 4
    • 100
    • 0
    • 🔵 Done
    • Hide

      Engineer(s): Viacheslav Babanin (Nathan Xu, Ross Lawley)

      2025-05-2

      Last two weeks?

      • Merged a follow up PR for SWAR optimization.
      • Made a final performance run to assess the overall performance impact.
        • Standard transport settings
          • Deep BSON Decoding | 19.44% | 5.4 z-score
          • Deep BSON Encoding | 102% | 22.8 z-score
          • Find many and empty the cursor | 25.08% | 13.72 z-score
          • Find one by ID | 2.7 % | 3.16 z-score
          • Flat BSON Decoding | 31.2% | 9.38 z-score
          • Flat BSON Encoding | 199.5% | 12.34 z-score
          • Full BSON Decoding | 16.5% | 7.23 z-score
          • Full BSON Encoding | 147.3% | 10.39 z-score
          • LDJSON multi-file export | 6% | 1.12 z-score
          • LDJSON multi-file import | 21.8% | 8.21 z-score
          • Large doc Client BulkWrite insert | 91.3% | 24.44 z-score
          • Large doc Collection BulkWrite insert | 96.5% | 8.79 z-score
          • Large doc bulk insert | 93.3% | 8.11 z-score
          • Large doc insertOne | 82.4% | 7.28 z-score
          • Run command | 3.9% | 5.18 z-score
          • Small doc Client BulkWrite insert | 49.5% | 17.99 z-score
          • Small doc Collection BulkWrite insert | 47.8% | 6.44 z-score
          • Small doc bulk insert | 39.3% | 5.72 z-score
          • Small doc insertOne | 2.9% | 1.91 z-score
        • Netty transport settings
          • Find many and empty the cursor | 40.3% | 14.81 z-score
          • Find one by ID | 4.4% | 4.12 z-score
          • LDJSON multi-file export | 4.3% | 1.19 z-score
          • LDJSON multi-file import | 16.9% | 3.77 z-score
          • Large doc Client BulkWrite insert | 54.8% | 14.49 z-score
          • Large doc Collection BulkWrite insert | 104.9% | 38.72 z-score
          • Large doc bulk insert | 74.6% | 65.55 z-score
          • Large doc insertOne | 66.6% | 58.65 z-score
          • Run command | 9.3% | 8.53 z-score
          • Small doc Client BulkWrite insert | 36.1% | 15.41 z-score
          • Small doc Collection BulkWrite insert | 39.3% | 37.38 z-score
          • Small doc bulk insert | 35.1% | 41.51 z-score

       


      2025-04-24

      Last two weeks?

      • Got all reviewed PRs merged into main.
      • Created and merged PR for BsonWriter optimization.

      Focus over the next two weeks?

      • Create a follow up PR for SWAR optimization.
      • Make a final performance run to assess the overall performance impact.

      2025-04-11

      Last two weeks?

      • Finalized read path optimizations and created a PR for review.
      • Created PRs for small improvements discovered during performance investigation.
      • Analyzed performance across all open and already merged PRs. Approximate gains observed
        • Find many and empty the cursor: ~18% improvement
        • Small doc bulk insert: ~29.4% improvement
        • Large doc bulk insert: ~82.4% improvement
        • Large doc insertOne: ~73.1% improvement
        • Small doc collection bulkWrite insert: ~36.1% improvement
        • Large doc collection bulkWrite insert: ~74% improvement
        • Small doc client bulkWrite insert: ~40.1% improvement
        • Large doc client bulkWrite insert: ~81.5% improvement
        • Deep BSON encoding: ~66.3% improvement
        • LDJSON multi-file import: ~26.7% improvement
        • Deep BSON decoding: ~16.6% improvement
        • Flat BSON decoding: ~23.3% improvement
        • Full BSON decoding: ~16.9% improvement
        • Flat BSON encoding: ~207.8% improvement
        • Full BSON encoding: ~98.2% improvement

      Focus over the next two weeks?

      • Get all currently reviewed PRs merged into main.
      • Make a final performance run to assess the overall performance impact.

      2025-03-28

      Last two weeks?

      • Finalized writeString optimizations and created a PR for review.

      Focus over the next two weeks?

      • Finalize read path optimizations and create a PR for review.
      • Create PRs for small improvements discovered during performance investigation.

      2025-03-14

      Last two weeks?

      • Refined BSON byte buffer numeric optimization changes and created a PR for review.
      • Merged codec performance improvements into the main branch.
      • Experimented with BSON read path optimizations, achieving a 25-30% improvement.
      • Created a PR for adding a Netty benchmark suite to measure the performance impact of recent changes.
      • Reviewed external performance PRs for BsonOutput improvements.

      Focus over the next two weeks?

      • Finalize writeString optimizations and create a PR for review.
      • Finalize read path optimizations and create a PR for review.
      • Introduce comprehensive test coverage to ensure there are no regressions after optimizations.

      Impediments encountered in the last two weeks?

      • While running performance tests, discovered a bug leading to a resource leak in Netty transport settings (Ticket: JAVA-5812).
      • Waiting times to run a comprehensive performance analysis on the Evergreen perf analyzer.

      2025-02-26

      Last two weeks?

      • Performance optimizations for BSON encoding and decoding to improve efficiency for both BsonArrayCodec and BsonDocumentCodec in codec lookups.
      • Added JMH into the repository, enabling local benchmarking to assess the relative performance impact of small components beyond spec benchmark tests.
      • Set up performance test environment on Evergreen spawn host to enable Linux perf CPU profiler.
      • Profiling encoding/write path to identify hotspots for optimization.
      • Experimented with byte buffer numeric optimizations. Initial experiments lead to an approximate 20-30% insert document rate improvement.
      • Experimented with byte buffer string optimizations, with experiments showing 30-90% insert rate improvement, and Deep BSON encoding achieving up to 200% improvement.

      Focus over the next two weeks?

      • Refine BSON byte buffer numeric optimization changes and create PR for review.
      • Get codec performance improvements through review and merge into the main branch.
      • Continue performance optimizations of writeString and determine next optimization steps.
      • Start profiling BSON read path to look for potential improvements. 
      • Review external performance PRs for BsonOutput improvements.
      • Consider adding a Netty benchmark suite to measure the performance impact of recent changes

      Impediments encountered in the last two weeks?

      • Discovered a discrepancy in bulk write behaviors (client bulk write and old one) while improving BsonDocumentCodec and need further clarifications.
      Show
      Engineer(s): Viacheslav Babanin (Nathan Xu, Ross Lawley) 2025-05-2 Last two weeks? Merged a follow up PR for SWAR optimization. Made a final performance run to assess the overall performance impact. Standard transport settings Deep BSON Decoding | 19.44% | 5.4 z-score Deep BSON Encoding | 102% | 22.8 z-score Find many and empty the cursor | 25.08% | 13.72 z-score Find one by ID | 2.7 % | 3.16 z-score Flat BSON Decoding | 31.2% | 9.38 z-score Flat BSON Encoding | 199.5% | 12.34 z-score Full BSON Decoding | 16.5% | 7.23 z-score Full BSON Encoding | 147.3% | 10.39 z-score LDJSON multi-file export | 6% | 1.12 z-score LDJSON multi-file import | 21.8% | 8.21 z-score Large doc Client BulkWrite insert | 91.3% | 24.44 z-score Large doc Collection BulkWrite insert | 96.5% | 8.79 z-score Large doc bulk insert | 93.3% | 8.11 z-score Large doc insertOne | 82.4% | 7.28 z-score Run command | 3.9% | 5.18 z-score Small doc Client BulkWrite insert | 49.5% | 17.99 z-score Small doc Collection BulkWrite insert | 47.8% | 6.44 z-score Small doc bulk insert | 39.3% | 5.72 z-score Small doc insertOne | 2.9% | 1.91 z-score Netty transport settings Find many and empty the cursor | 40.3% | 14.81 z-score Find one by ID | 4.4% | 4.12 z-score LDJSON multi-file export | 4.3% | 1.19 z-score LDJSON multi-file import | 16.9% | 3.77 z-score Large doc Client BulkWrite insert | 54.8% | 14.49 z-score Large doc Collection BulkWrite insert | 104.9% | 38.72 z-score Large doc bulk insert | 74.6% | 65.55 z-score Large doc insertOne | 66.6% | 58.65 z-score Run command | 9.3% | 8.53 z-score Small doc Client BulkWrite insert | 36.1% | 15.41 z-score Small doc Collection BulkWrite insert | 39.3% | 37.38 z-score Small doc bulk insert | 35.1% | 41.51 z-score   2025-04-24 Last two weeks? Got all reviewed PRs merged into main. Created and merged PR for BsonWriter optimization. Focus over the next two weeks? Create a follow up PR for SWAR optimization. Make a final performance run to assess the overall performance impact. 2025-04-11 Last two weeks? Finalized read path optimizations and created a PR for review. Created PRs for small improvements discovered during performance investigation. Analyzed performance across all open and already merged PRs. Approximate gains observed Find many and empty the cursor: ~18% improvement Small doc bulk insert: ~29.4% improvement Large doc bulk insert: ~82.4% improvement Large doc insertOne: ~73.1% improvement Small doc collection bulkWrite insert: ~36.1% improvement Large doc collection bulkWrite insert: ~74% improvement Small doc client bulkWrite insert: ~40.1% improvement Large doc client bulkWrite insert: ~81.5% improvement Deep BSON encoding: ~66.3% improvement LDJSON multi-file import: ~26.7% improvement Deep BSON decoding: ~16.6% improvement Flat BSON decoding: ~23.3% improvement Full BSON decoding: ~16.9% improvement Flat BSON encoding: ~207.8% improvement Full BSON encoding: ~98.2% improvement Focus over the next two weeks? Get all currently reviewed PRs merged into main. Make a final performance run to assess the overall performance impact. 2025-03-28 Last two weeks? Finalized writeString optimizations and created a PR for review. Focus over the next two weeks? Finalize read path optimizations and create a PR for review. Create PRs for small improvements discovered during performance investigation. 2025-03-14 Last two weeks? Refined BSON byte buffer numeric optimization changes and created a PR for review. Merged codec performance improvements into the main branch. Experimented with BSON read path optimizations, achieving a 25-30% improvement. Created a PR for adding a Netty benchmark suite to measure the performance impact of recent changes. Reviewed external performance PRs for BsonOutput improvements. Focus over the next two weeks? Finalize writeString optimizations and create a PR for review. Finalize read path optimizations and create a PR for review. Introduce comprehensive test coverage to ensure there are no regressions after optimizations. Impediments encountered in the last two weeks? While running performance tests, discovered a bug leading to a resource leak in Netty transport settings (Ticket: JAVA-5812 ). Waiting times to run a comprehensive performance analysis on the Evergreen perf analyzer. 2025-02-26 Last two weeks? Performance optimizations for BSON encoding and decoding to improve efficiency for both BsonArrayCodec and BsonDocumentCodec in codec lookups. Added JMH into the repository, enabling local benchmarking to assess the relative performance impact of small components beyond spec benchmark tests. Set up performance test environment on Evergreen spawn host to enable Linux perf CPU profiler. Profiling encoding/write path to identify hotspots for optimization. Experimented with byte buffer numeric optimizations. Initial experiments lead to an approximate 20-30% insert document rate improvement. Experimented with byte buffer string optimizations, with experiments showing 30-90% insert rate improvement, and Deep BSON encoding achieving up to 200% improvement. Focus over the next two weeks? Refine BSON byte buffer numeric optimization changes and create PR for review. Get codec performance improvements through review and merge into the main branch. Continue performance optimizations of writeString and determine next optimization steps. Start profiling BSON read path to look for potential improvements.  Review external performance PRs for BsonOutput improvements. Consider adding a Netty benchmark suite to measure the performance impact of recent changes Impediments encountered in the last two weeks? Discovered a discrepancy in bulk write behaviors (client bulk write and old one) while improving BsonDocumentCodec and need further clarifications.

      There are both known and potential opportunities to improve performance across the driver. This Epic covers the investigation of such opportunities and the implementation of feasible optimizations.

      Summary of Performance Optimizations
      Following profiling, benchmarking, and code analysis, several improvements were explored, prototyped, validated, and implemented.

      BSON codec optimizations:

      • BsonArrayCodec and BsonDocumentCodec now use a BsonTypeCodecMap to avoid costly CodecRegistry#get lookups.

      ObjectId optimization:

      • In-memory representation of ObjectId changed from (int, int, short, int) to (int, long) for more efficient sorting and serialization.
      • The optimization above enabled BsonObjectId#compareTo to be refactored to avoid allocating extra arrays; it now directly compares fields using Integer.compareUnsigned.

      Write path improvements:

      • Replaced byte-level writes with ByteBuffer.putInt/putLong/putDouble where applicable as those methods are often intrinsified with efficient machine instruction and allow to store more bytes in a single instruction, eliminating the overhead of a native method calls. 
      • ByteBufferBsonOutput now caches the current buffer in a field instead of repeated bufferList.get(index) calls - reducing bounds checks.
      • BasicOutputBuffer#toByteArray() was optimized to avoid double-copying.
      • Reduced constant factor of buffer lookups in writeInt32 from 4 to 1 by preferring a single putInt call when capacity allows.
      • Replaced java.util.Stack with ArrayDeque in BsonWriter - removing unnecessary synchronization overhead.
      • Introduced caching for array index field names ("0", "1", ...) avoiding repetitive allocations.

      Read path Improvements:

      • Reduced allocations during String decoding by reusing buffers and avoiding unnecessary copies.
      • Scanned for null terminators in bulk (SWAR) instead of byte-by-byte.

      The following optimizations were identified but not completed within the scope of this epic:

      • BsonDocumentCodec _id field re-ordering could be more efficient
      • Introduce valueOf methods in BsonInt32 and BsonInt64 (and use them in corresponding Codec) that are roughly equivalent to the ones in Integer and Long (which have caches for small values).{}
      • BsonDocumentCodec shouldn't make a copy of its elements when decoding
      • ByteBufferBsonOutput is great for minimizing heap use, but there is a performance cost of all the buffer management that it has to do in an inner loop. We can consider using a simpler implementation of OutputBuffer that trades off memory use for speed. For example, we could just cache 48MB buffers instead of power-of-two buffers.
      • Lots of calls to CodecRegistry#get in DocumentCodec#writeValue and {BsonDocumentCodec#writeValue}}. The implementation of this method in ProvidersCodecRegistry is not built for use in inner loops. Some caching within the Codec implementation could be useful here.

            Assignee:
            Slav Babanin
            Reporter:
            Jeffrey Yemin
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved:
              1 year, 11 weeks, 6 days