ByteArray field in Kotlin data class encodes as BSON Array of Int32 instead of BSON Binary since 4.11.3/5.1.3

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Unknown
    • None
    • Affects Version/s: 4.11.3, 5.1.3
    • Component/s: Kotlin
    • None
    • None
    • Java Drivers
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      Since mongodb-driver-kotlin-sync 4.11.3 & 5.1.3, a ByteArray field in a Kotlin data class is encoded by {{DataClassCodec }}as a BSON Array of Int32 elements (one element per byte) instead of BSON Binary. This causes an ~8-10x document size expansion, silently breaking code that stores binary content in a data class.

      A document holding 2 MB of binary data now encodes to ~20 MB, exceeding MongoDB's 16 MB document limit and throwing BsonMaximumSizeExceededException at runtime with no compile-time warning.

      Root cause

      PR #1457 (JAVA-5122) introduced an isArray() check at the top of DataClassCodec.getCodec().
      When the check is true, ArrayCodec.create() is called unconditionally, bypassing the codec registry entirely.

      Before 5.1.3, a ByteArray field fell through to the codec registry, which returned ByteArrayCodec, encoding the value as compact BSON Binary. From 5.1.3 onward, the isArray() check intercepts ByteArray first and routes it through ArrayCodec, which iterates each byte and writes it as a separate BSON Int32.

      JAVA-5122 was filed for Array<String> — object arrays that previously threw during codec construction. The fix correctly addresses that case but overly broadly captures primitive arrays like ByteArray, which already had a correct and compact encoding via ByteArrayCodec.

      Expected behavior

      A ByteArray field in a Kotlin data class should be encoded as BSON Binary, consistent with the behavior of ByteArrayCodec and with all prior versions of mongodb-driver-kotlin-sync.

      Actual behavior

      Since 4.11.3 & 5.1.3, the field is encoded as a BSON Array of Int32 elements, one per byte.

      Reproducer

      No MongoDB instance is required to run this test.

      import com.mongodb.kotlin.client.MongoClient
      import org.bson.BsonBinaryWriter
      import org.bson.codecs.EncoderContext
      import org.bson.io.BasicOutputBuffer
      import org.junit.jupiter.api.Test
      import org.assertj.core.api.Assertions.assertThat
      
      data class Attachment(val name: String, val content: ByteArray)
      
      class ByteArrayDataClassEncodingTest {
      
          @Test
          fun `storing an attachment should not exceed MongoDB 16MB limit for reasonable content sizes`() {
              MongoClient.create("mongodb://localhost:27017").use { client ->
                  val codec = client.getDatabase("test")
                      .getCollection<Attachment>("test")
                      .codecRegistry
                      .get(Attachment::class.java)
      
                  val twoMegabytes = ByteArray(2_000_000)
                  val buffer = BasicOutputBuffer()
                  BsonBinaryWriter(buffer).use { writer ->
                      codec.encode(writer, Attachment("report.pdf", twoMegabytes), EncoderContext.builder().build())
                  }
      
                  // Expected: ~2 MB. Actual on mongodb-driver-kotlin-sync >= 5.1.3: ~20 MB
                  assertThat(buffer.size)
                      .describedAs("encoded document size")
                      .isLessThan(16_000_000)
              }
          }
      }
      

      The test :

      • passes on 4.11.2 and fails from 4.11.3 onward,
      • passes on 5.1.2 and fails from 5.1.3 onward.

      Impact

      • Silent regression: existing code that stores ByteArray fields in data classes breaks at runtime with no compile-time indication
      • Any binary content larger than ~1.6 MB stored via a data class field will throw
        BsonMaximumSizeExceededException
      • I fear the old encoding (BSON Binary) and the new encoding (BSON Array) are not round-trip compatible: data written before the upgrade cannot be read back correctly after upgrading

      Version information

       

            Assignee:
            Ross Lawley
            Reporter:
            Aurélien Minvielle
            None
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: