[SERVER-32142] `movePrimary` can leave orphaned data when it aborts after cloning Created: 01/Dec/17  Updated: 06/Dec/22  Resolved: 09/Oct/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.18, 3.4.10, 3.6.0, 3.7.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Won't Fix Votes: 0
Labels: pm-1051-legacy-tickets, sharding-causes-bfs-hard
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-46425 Consider increasing wtimeout for clon... Backlog
related to SERVER-46424 _cloneCatalogData remote call is labe... Closed
is related to SERVER-31398 _configsvrMovePrimary retries fail if... Closed
Assigned Teams:
Sharding
Operating System: ALL
Participants:
Linked BF Score: 27

 Description   

If a movePrimary command was able to clone the database but failed to complete (for example, because it stepped down), it will leave the database and the other collections in the original shard. Attempting to call the command again will do nothing because the primary database is now officially moved to the another shard, leaving the unsharded collections orphaned on the old primary shard.

There's also another variant where it fails after it successfully clones, but before it updates config.databases. In this scenario, attempting to retry to command will result in the command attempting to call clone again, but fail with collection already exits.



 Comments   
Comment by Kaloian Manassiev [ 09/Oct/20 ]

This is a deficiency of MovePrimary, which we will not address in exchange for making it use the moveChunk functionality for unsharded collections.

Comment by Kaloian Manassiev [ 30/Sep/20 ]

Putting in Needs Triage to decide officially to close SERVER-32142 and SERVER-31398 as won't fix because we will likely not fix movePrimary for older versions and these are not data-loss bugs, but more of a test nuisance.

Comment by Randolph Tan [ 21/Dec/17 ]

Just realized there's already an existing ticket for the variant failure - SERVER-31398.

Comment by Randolph Tan [ 01/Dec/17 ]

kaloian.manassiev No

Comment by Kaloian Manassiev [ 01/Dec/17 ]

renctan, is this any different than movePrimary failing in any version prior to 3.6?

Generated at Thu Feb 08 04:29:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.