[SERVER-31398] _configsvrMovePrimary retries fail if clone from old primary completed in a previous attempt Created: 04/Oct/17  Updated: 06/Dec/22  Resolved: 09/Oct/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Won't Fix Votes: 0
Labels: PM-1017, sharding-causes-bfs-hard
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-32142 `movePrimary` can leave orphaned data... Closed
is related to SERVER-31526 Replace use of ScopedDbConnection in ... Closed
Assigned Teams:
Sharding
Operating System: ALL
Participants:
Linked BF Score: 26

 Description   

If the CSRS primary steps down while executing _configsvrMovePrimary after having told the toShard to clone the non-sharded collections from the old primary shard, the clone will still complete because the shards won't know the command was interrupted, but mongos will receive a retryable error and retry _configsvrMovePrimary when the next primary steps up, because it uses RetryPolicy::kIdempotent. When the new primary sends the clone command to the toShard again, the clone will fail, because the namespaces to be cloned will already exist on the toShard, causing the whole command to fail (unless there weren't any unsharded collections).



 Comments   
Comment by Kaloian Manassiev [ 09/Oct/20 ]

This is a deficiency of MovePrimary, which we will not address in exchange for making it use the moveChunk functionality for unsharded collections.

Comment by Kaloian Manassiev [ 30/Sep/20 ]

This was marked as dependent on PM-1645, but there is nothing in that project that would have addressed it.

Putting in Needs Triage to decide officially to close SERVER-32142 and SERVER-31398 as won't fix because we will likely not fix movePrimary for older versions and these are not data-loss bugs, but more of a test nuisance.

Comment by Esha Maharishi (Inactive) [ 11/May/18 ]

Hm, I think this failure would still occur after Enable Safe Migrations... but the "silently skip new data" part would go away, so maybe that fix would make sense at that point. It could go in either the "Enable Safe Migrations" or "Sharding Task Queue" epics.

Comment by Kaloian Manassiev [ 11/May/18 ]

I suppose the clone command could be made idempotent, so that it succeeds if it already has all the collections?

This sounds like a bad idea This means it can silently skip new data that got added to these collections, which is a total change in behaviour. With the project to enable migrations of unsharded collections, this is also totally unnecessary.

Comment by Esha Maharishi (Inactive) [ 11/May/18 ]

kaloian.manassiev, yes, SERVER-32142 is the reason. SERVER-31526 is just loosely related because it also makes movePrimary fail on transient errors, except between the config and shard rather than mongos and config.

I suppose the clone command could be made idempotent, so that it succeeds if it already has all the collections? Once movePrimary does not move actual collection data, this might be reasonable. I'm not opposed to just blacklisting these tests as a temporary fix, though.

Comment by Kaloian Manassiev [ 11/May/18 ]

Is this just a result of _configSvrMovePrimary not being idempotent and not a matter of whether or not it got interrupted on the original primary, which stepped down? I.e. - why is it related to SERVER-31526 at all? Seems like SERVER-32142 is the reason.

esha.maharishi - how are we going to fix it? Making movePrimary idempotent without a resumable task queue is impossible.

We should just blacklist these tests from the stepdown suite.

Comment by Esha Maharishi (Inactive) [ 01/May/18 ]

If we don't plan to fix this soon, we may want to blacklist the failing tests from the config stepdown suite.

Generated at Thu Feb 08 04:26:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.