The recovery of chunk migrations may cause a server crash of the donor shard when nodes run mixed binaries

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • 8.2 Required
    • Affects Version/s: 8.2.0-rc0
    • Component/s: None
    • None
    • Catalog and Routing
    • ALL
      • a chunk migration starts from a donor node with 8.2 binary
      • the donor node stepdown before competing the migration
      • the new elected primary is on 8.0/8.1
      • the new primary crash
    • CAR Team 2025-07-07
    • None
    • 3
    • TBD
    • 🟥 DDL
    • None
    • None
    • None
    • None
    • None
    • None

      SERVER-74032 introduced the new transfersFirstCollectionChunkToRecipient optional field to the schema of MigrationCoordinatorDocument to generate the expected change stream events in case of donor step down.

      Such a schema is currently protected by a strict constraint, so that the persistence of this new field must be FCV gated - and inflight migrations drained upon FCV downgrade.

      Test regressions revealed that the implementation of such protections is nevertheless flawed: when the donor sets a commit/abort decision on the migration, the recovery document receives an update that includes the backwards incompatible field, opening the path to the following scenario:

      • At commit decision time, the primary node of the donor (running on a 8.2 binary) persists an update  of the recovery document including the transfersFirstCollectionChunkToRecipient
      • An election occurs
      • The new primary (running on a 8.0 binary) runs the migration recovery routine - and crashes due to an unmanaged DBException due to the inability of parsing the recovery document

            Assignee:
            Paolo Polato
            Reporter:
            Paolo Polato
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: