Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.2.0-rc0, 8.0.12
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Replication
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v8.0
Sprint:
Repl 2025-04-14
Linked BF Score:
200
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

SPM-2322 introduced a quiesced state to resharding operations that follow the resharding committing and aborting states. It allows metadata documents to persist on a node for a default of 15 minutes, allowing users to query the database for results of resharding operations. This only happens when the user passes in a reshardingUUID parameter to the command.

In magic restore (and the restore procedure on cloud), we attempt to abort any in-progress resharding operations. We do this by searching the config.reshardingOperations collection to find any document with a state != committing, and we set the state to aborting. However, this predicate includes documents in the quiesced state. Since this state occurs after a resharding operation is committed or aborted, it isn't logically correct to set this state to aborting. In the case of a successful resharding operation, the data has already been resharded.

Note that when a node gets an explicit abort command for a resharding operation in the quiesced state, the node just ends the quiesce state early. The resharding operation still succeeded.

We should modify the restore check here to check for state not in ["committing", "aborting", "quiesced"]. We should add a test case for quiesced and aborted metadata documents to the existing restore resharding tests.

Although this is rare, the impact is a user that is querying a quiesced resharding coordinator document will see an aborted operation, when in fact the data has already been resharded.

is caused by

SERVER-100421 Resharding failure leads to all values inserted as zeroes in atlas log ingestion

Closed

related to

SERVER-103191 getCoordinatorDoc May Fail If Called From Retry Loop Which Deletes It

Closed

SERVER-103794 Make resharding_perform_verification.js specify nss when finding resharding coordinator doc

Closed

Assignee:: Ali Mir
Reporter:: Ali Mir
Participants:: Ali Mir
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Apr 01 2025 07:13:20 PM UTC
Updated:: Jun 25 2025 09:12:50 PM UTC
Resolved:: Apr 04 2025 02:50:40 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates