[SERVER-31398] _configsvrMovePrimary retries fail if clone from old primary completed in a previous attempt Created: 04/Oct/17 Updated: 06/Dec/22 Resolved: 09/Oct/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jack Mulrow | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | PM-1017, sharding-causes-bfs-hard | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Sharding
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 26 | ||||||||||||||||
| Description |
|
If the CSRS primary steps down while executing _configsvrMovePrimary after having told the toShard to clone the non-sharded collections from the old primary shard, the clone will still complete because the shards won't know the command was interrupted, but mongos will receive a retryable error and retry _configsvrMovePrimary when the next primary steps up, because it uses RetryPolicy::kIdempotent. When the new primary sends the clone command to the toShard again, the clone will fail, because the namespaces to be cloned will already exist on the toShard, causing the whole command to fail (unless there weren't any unsharded collections). |
| Comments |
| Comment by Kaloian Manassiev [ 09/Oct/20 ] |
|
This is a deficiency of MovePrimary, which we will not address in exchange for making it use the moveChunk functionality for unsharded collections. |
| Comment by Kaloian Manassiev [ 30/Sep/20 ] |
|
This was marked as dependent on PM-1645, but there is nothing in that project that would have addressed it. Putting in Needs Triage to decide officially to close |
| Comment by Esha Maharishi (Inactive) [ 11/May/18 ] |
|
Hm, I think this failure would still occur after Enable Safe Migrations... but the "silently skip new data" part would go away, so maybe that fix would make sense at that point. It could go in either the "Enable Safe Migrations" or "Sharding Task Queue" epics. |
| Comment by Kaloian Manassiev [ 11/May/18 ] |
This sounds like a bad idea |
| Comment by Esha Maharishi (Inactive) [ 11/May/18 ] |
|
kaloian.manassiev, yes, I suppose the clone command could be made idempotent, so that it succeeds if it already has all the collections? Once movePrimary does not move actual collection data, this might be reasonable. I'm not opposed to just blacklisting these tests as a temporary fix, though. |
| Comment by Kaloian Manassiev [ 11/May/18 ] |
|
Is this just a result of _configSvrMovePrimary not being idempotent and not a matter of whether or not it got interrupted on the original primary, which stepped down? I.e. - why is it related to esha.maharishi - how are we going to fix it? Making movePrimary idempotent without a resumable task queue is impossible. We should just blacklist these tests from the stepdown suite. |
| Comment by Esha Maharishi (Inactive) [ 01/May/18 ] |
|
If we don't plan to fix this soon, we may want to blacklist the failing tests from the config stepdown suite. |