Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-36126

moveChunk fails on the source shard when receiving a NotMaster error.

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.6.6, 4.0.0, 4.0.1
    • Component/s: Sharding
    • Labels:
      None
    • Sharding
    • ALL
    • 8

      On a new test change_stream_shard_failover.js, we expose behavior with moveChunks and stepdowns.

      In a chunk migration:
      When a primary node steps down from a shard during receiving a chunk, it will return a NotMaster error to the source shard. The source shard will fail the entire operations due to a NotMaster error. In the aforementioned test, it happens during a shardCollection – the ReplicaSetMonitor on the source shard has outdated information as to what connection to target.

      In discussions with esha.maharishi and kaloian.manassiev, we concluded that parts of moveChunk aren't retryable – the protocol acts like a state machine, and we require each state to come after another specific state. For example, we can't transfer modifications before completely cloning. Additionally, if an old recipient's cloning process was still running while a new recipient stepped in, it's possible for both recipients to attempt to siphon data from the source shard.

      In the failures seen in the linked BF, we fail right after the cloning process begins – the first step in a moveChunk. We have identified an area where we can run a loop, and retry on NotMaster failures. We also guarantee that the session ID will get regenerated each time, preventing an old secondary recipient from continuing with the protocol.

            Assignee:
            backlog-server-sharding [DO NOT USE] Backlog - Sharding Team
            Reporter:
            blake.oler@mongodb.com Blake Oler
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: