RawBsonDocument serialization regression introduced in 5.5

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • 5.7.0
    • Affects Version/s: 5.5.0
    • Component/s: Codecs
    • None
    • None
    • Java Drivers
    • Not Needed
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      Mongot is currently using the mongodb-driver-sync:4.11.5  and does a lot a manipulation of RawBsonDocuments when materializing its result sets. We rely on the fact that serializing a RawBsonDocument is a simple, efficient array copy. When upgrading the mongodb-driver-sync to 5.5.3, we see several performance tests showing >100% regressions in E2E latency.

       

      Version 5.5.1 without patch vs Baseline 4.11.5:

      https://performance-analyzer.server-tig.prod.corp.mongodb.com/perf-analyzer-viz/?comparison_id=698cc5e65ae6ba72cd67632b

       

      Version 5.5.1 with patch vs Baseline 4.11.5: https://performance-analyzer.server-tig.prod.corp.mongodb.com/perf-analyzer-viz/?comparison_id=698d00816295f61738b0357c&selected_tab=scatter-plots&percent_filter=0%7C%7C100&z_filter=0%7C%7C10&filter_type=Default

       

      Disclaimer: The following Root Cause Analysis was identified by AI, but we verified its proposed workaround resolved the observed regressions.

      It seems that https://github.com/mongodb/mongo-java-driver/pull/1632 changed BsonDocumentCodec to lookup Codecs by BsonType rather than by `getClass()`. This is problematic because BsonDocument and RawBsonDocument share a BsonType but require different handling. In the new implementation RawBsonDocuments are serialized by iterating over their values, deserializing them, and reserializing them. This results in several perf tests showing >100% increase in E2E latencies.

       

      How to Reproduce

      Create a BsonDocument that contains a BsonArray of 100,000 small RawBsonDocuments and serialize the object to a byte[].

       

      This is the fix suggested by Cursor. Please let us know if we can do something better.

      /**
       * Creates a {@link BsonDocumentCodec} backed by a {@link CodecRegistry} that provides {@link
       * RawBsonDocumentAwareBsonDocumentCodec} for {@code BsonDocument.class}. Because the codec's
       * internal {@code BsonTypeCodecMap} resolves the codec for {@code BsonType.DOCUMENT} from the
       * registry (keyed by {@code BsonDocument.class}), this ensures our override is used for ALL
       * nested document values — including {@link RawBsonDocument} instances inside arrays and
       * sub-documents.
       */
      private static BsonDocumentCodec createOptimizedCodec() {
        // A provider that returns our RawBsonDocument-aware codec for BsonDocument.class, delegating
        // everything else to the standard BsonValueCodecProvider.
        CodecProvider optimizedProvider =
            new CodecProvider() {
              private final BsonValueCodecProvider delegate = new BsonValueCodecProvider();
      
              @Override
              @SuppressWarnings("unchecked")
              public <T> Codec<T> get(Class<T> clazz, CodecRegistry registry) {
                if (clazz == BsonDocument.class) {
                  return (Codec<T>) new RawBsonDocumentAwareBsonDocumentCodec(registry);
                }
                return delegate.get(clazz, registry);
              }
            };
        CodecRegistry registry = CodecRegistries.fromProviders(optimizedProvider);
        return new RawBsonDocumentAwareBsonDocumentCodec(registry);
      }
      
      /**
       * A {@link BsonDocumentCodec} subclass that restores the BSON 4.x behavior of encoding {@link
       * RawBsonDocument} values by piping their raw bytes directly, rather than decoding and
       * re-encoding each field.
       *
       * <p>In driver 5.x, {@link BsonDocumentCodec} resolves child codecs via a {@code
       * BsonTypeCodecMap} keyed by {@code BsonType}, so a {@link RawBsonDocument} (which has
       * BsonType.DOCUMENT) gets the generic {@link BsonDocumentCodec} codec. That codec calls {@code
       * RawBsonDocument.entrySet()} which lazily decodes the raw bytes, then re-encodes every field —
       * an O(fields) decode+encode instead of an O(bytes) memcpy.
       *
       * <p>This class overrides {@code encode()} to detect {@link RawBsonDocument} and use pipe-based
       * byte copying. Because it is registered in the codec registry as the codec for {@code
       * BsonDocument.class}, the internal {@code BsonTypeCodecMap} maps {@code BsonType.DOCUMENT} to
       * this codec, so the optimization applies recursively to all nested document values.
       */
      static final class RawBsonDocumentAwareBsonDocumentCodec extends BsonDocumentCodec {
        private static final RawBsonDocumentCodec RAW_CODEC = new RawBsonDocumentCodec();
      
        RawBsonDocumentAwareBsonDocumentCodec(CodecRegistry registry) {
          super(registry);
        }
      
        @Override
        public void encode(BsonWriter writer, BsonDocument document, EncoderContext encoderContext) {
          if (document instanceof RawBsonDocument rawDoc) {
            // Delegate to RawBsonDocumentCodec which pipes the raw bytes directly.
            RAW_CODEC.encode(writer, rawDoc, encoderContext);
          } else {
            super.encode(writer, document, encoderContext);
          }
        }
      } 

            Assignee:
            Slav Babanin
            Reporter:
            Evan Darke
            None
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: