Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.6.6, 4.0.0, 4.0.1
Component/s: Sharding
Labels:
None

Assigned Teams:

Sharding
Operating System:
ALL
Linked BF Score:
8
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

On a new test change_stream_shard_failover.js, we expose behavior with moveChunks and stepdowns.

In a chunk migration:
When a primary node steps down from a shard during receiving a chunk, it will return a NotMaster error to the source shard. The source shard will fail the entire operations due to a NotMaster error. In the aforementioned test, it happens during a shardCollection – the ReplicaSetMonitor on the source shard has outdated information as to what connection to target.

In discussions with esha.maharishi and kaloian.manassiev, we concluded that parts of moveChunk aren't retryable – the protocol acts like a state machine, and we require each state to come after another specific state. For example, we can't transfer modifications before completely cloning. Additionally, if an old recipient's cloning process was still running while a new recipient stepped in, it's possible for both recipients to attempt to siphon data from the source shard.

In the failures seen in the linked BF, we fail right after the cloning process begins – the first step in a moveChunk. We have identified an area where we can run a loop, and retry on NotMaster failures. We also guarantee that the session ID will get regenerated each time, preventing an old secondary recipient from continuing with the protocol.

duplicates

SERVER-31922 Make the migration chunk cloner source resilient to stepdowns and network errors

Closed

Assignee:: [DO NOT USE] Backlog - Sharding Team
Reporter:: Blake Oler
Participants:: [DO NOT USE] Backlog - Sharding Team, Blake Oler
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Jul 13 2018 05:51:02 PM UTC
Updated:: Dec 06 2022 03:24:13 AM UTC
Resolved:: Jul 29 2019 08:59:53 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates