Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-61127

Multi-writes may exhaust the number of retry attempts in the presence of ongoing chunk migrations

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 6.1.0-rc0, 6.0.8
    • Affects Version/s: None
    • Component/s: Sharding
    • Labels:
      None
    • Fully Compatible
    • ALL
    • v6.0
    • Sharding EMEA 2022-01-24, Sharding EMEA 2022-02-07, Sharding EMEA 2022-02-21, Sharding EMEA 2022-03-07, Sharding EMEA 2022-03-21, Sharding EMEA 2022-05-02, Sharding EMEA 2022-05-16, Sharding EMEA 2022-05-30, Sharding EMEA 2022-06-13
    • 1

      Multi-writes in a sharded cluster (updateMany:true and justOne:false for deletes) do not perform version checking on account that they are broadcast to all nodes in the sharded cluster. Such operations attach the special value ChunkVersion::IGNORED to indicate that an operation is coming from a router (as opposed to direct connection to a shard), but that the shard must not perform version checking, under the assumption that the caller knows what they are doing.

      However, ChunkVersion::IGNORED still triggers a StaleShardVersion exception in the case where the shardVersion is UNKNOWN or if the shard is in a critical section.

      The former is not a big problem, since it only happens once for the duration of a shard's MongoD process, but the latter is problematic since it may exhaust the 10 retry attempts that we allow on the router.

      This ticket is to come-up with a scheme so that multi-writes' StaleShardVersion exceptions be retried at the level of the shard and not bubble up all the way up to the router.

            Assignee:
            jordi.serra-torrens@mongodb.com Jordi Serra Torrens
            Reporter:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Votes:
            1 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: