[SERVER-36126] moveChunk fails on the source shard when receiving a NotMaster error. Created: 13/Jul/18  Updated: 06/Dec/22  Resolved: 29/Jul/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.6, 4.0.0, 4.0.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Blake Oler Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-31922 Make the migration chunk cloner sourc... Closed
Assigned Teams:
Sharding
Operating System: ALL
Participants:
Linked BF Score: 8

 Description   

On a new test change_stream_shard_failover.js, we expose behavior with moveChunks and stepdowns.

In a chunk migration:
When a primary node steps down from a shard during receiving a chunk, it will return a NotMaster error to the source shard. The source shard will fail the entire operations due to a NotMaster error. In the aforementioned test, it happens during a shardCollection – the ReplicaSetMonitor on the source shard has outdated information as to what connection to target.

In discussions with esha.maharishi and kaloian.manassiev, we concluded that parts of moveChunk aren't retryable – the protocol acts like a state machine, and we require each state to come after another specific state. For example, we can't transfer modifications before completely cloning. Additionally, if an old recipient's cloning process was still running while a new recipient stepped in, it's possible for both recipients to attempt to siphon data from the source shard.

In the failures seen in the linked BF, we fail right after the cloning process begins – the first step in a moveChunk. We have identified an area where we can run a loop, and retry on NotMaster failures. We also guarantee that the session ID will get regenerated each time, preventing an old secondary recipient from continuing with the protocol.


Generated at Thu Feb 08 04:42:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.