Shard overwrites bulk write response payload when metadata refresh is interrupted

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: 8.0.0, 8.2.0
    • Component/s: None
    • None
    • Catalog and Routing
    • 🟦 Shard Catalog
    • None
    • None
    • None
    • None
    • None
    • None

      When a shard executing a bulk write consisting of multiple operations fails some of them, for example due to StaleConfig error, it produces an ok:1 response with a payload indicating which writes succeeded vs which ones did not. This allows correct reporting of the operation outcome and also allows mongos to perform retries when it is safe to do so.

      Before responding to the mongos, shards attempt to recover the sharding metadata. However, if that fails due to an Interruption error, shards overwrite the ok: 1 response and instead throw top-level ok: 0 response. This causes the detailed per-operation outcome to be lost, which makes the mongos unable to determine the appropriate reties. That error is then propagated to the driver, without any information about what operations succeeded vs failed.

      In the case of retryableWrites=true, the driver is able to retry safely the whole operation, so this is transparent to the app, although with some inefficiency due to retrying operations that definitely had succeed already.

      In the case of retryableWrites=false, the driver is not able to retry and the app simply gets a top-level error that doesn't report the individual writes outcomes.

      Shards should avoid discarding the response payload indicating the individual write operation outcomes.

        1. 0001-repro.patch
          6 kB
          Jordi Serra Torrens

            Assignee:
            Unassigned
            Reporter:
            Jordi Serra Torrens
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: