Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 7.3.0-rc0, 7.2.0-rc2, 7.0.5
Affects Version/s: 7.0.1, 7.1.0-rc3, 7.2.0-rc1
Component/s: None
Labels:
- shardingemea-qw

Assigned Teams:

Sharding EMEA
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v7.2, v7.0
Sprint:
Sharding EMEA 2023-10-16, Sharding EMEA 2023-10-30, CAR Team 2023-11-13, CAR Team 2023-11-27
Linked BF Score:
20
Story Points:
3
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

TL;DR, in the event of a step-down of the primary node of the donor shard while the cloning phase of a movePrimary operation is in progress, the cloning procedure on recipient side is not aborted. This causes the presence of orphaned collections on the recipient and the consequent failure of any attempt to repeat the movePrimary operation (NamespaceExists error).

Technical details

During the cloning phase of the movePrimary operation, the DDL coordinator calls the _shardsvrCloneCatalogData command of the recipient, which creates and fetches all unsharded collections from the donor to the recipient. In the event of a failure (step-down) during this phase, the coordinator drops the data possibly cloned on the recipient and aborts the movePrimary operation.

The bug is that the coordinator doesn't abort the data cloning procedure possibly running on the recipient. The clean up of data, possibly already cloned on the recipient, doesn't resolve the problem since the cloning procedure could be running in background.

User impacts

The recipient shard could own orphaned collections which cause any attempt to repeat the movePrimary operation to fail. There is no evident business impact (data remain consistent) but a manual intervention on the recipient is required to drop these orphaned collections and then to allow a new movePrimary attempt to work.

Potential solution

The cloning phase of a movePrimary is heavily expensive in terms of execution times (in production it could take hours), so the cloning operation on the recipient side must not be joined but aborted. An idea is to tag the _shardsvrCloneCatalogData operation and to kill it (using the tag) when the movePrimary operation is recovered by the coordinator (before to clean any cloned data).

related to

SERVER-83230 Kill clone operation of move primary on cleanup

Closed

Assignee:: Marcos José Grillo Ramirez
Reporter:: Silvia Surroca
Participants:: Githook User, Marcos José Grillo Ramirez, Silvia Surroca
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Sep 20 2023 06:53:56 AM UTC
Updated:: Nov 24 2023 07:45:50 PM UTC
Resolved:: Nov 22 2023 09:06:37 AM UTC
Confidence Status Last Update:: 08/Nov/23 2:56 PM

Details

Description

Technical details

User impacts

Potential solution

Attachments

Issue Links

Forms

Activity

People

Dates