Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Critical - P2
Fix Version/s: 5.2.0, 5.0.5, 5.1.1
Affects Version/s: 5.0.0, 5.1.0
Component/s: Sharding
Labels:
- sharding-nyc-subteam1

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.1, v5.0
Sprint:
Sharding 2021-11-29
Story Points:
2
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The ReshardingCoordinator relies on an exception being thrown and its .onError() handler being called to trigger its _shardsvrAbortReshardCollection flow. However, the ReshardingCoordinator fails to read the current state of the coordinator document to trigger the _shardsvrAbortReshardCollection flow when an earlier config server primary had already decided the resharding operation must abort. The lack of the .onError() handler being called leads the ReshardingCoordinator to attempt to commit the resharding operation anyway. This is severely problematic because the resulting collection will be incomplete and inconsistent (i.e. lost writes).

Shards which had already received the _shardsvrAbortReshardCollection command from the earlier config server primary's resharding coordinator may have dropped the temporary resharding collection already. These shards effectively ignore the _shardsvrCommitReshardCollection command.
Other shards which erroneously receive the _shardsvrCommitReshardCollection command will rename the temporary resharding collection over the source collection.
- Even shards which voted to abort to abort resharding operation (e.g. unrecoverable error during collection cloning or oplog application) can still rename the temporary resharding collection over the source collection.
- However shards which aren't in the "strict-consistency" state (recipient role) and aren't in the "blocking-writes" state (donor role) will reject the _shardsvrCommitReshardCollection command. The ReshardCollectionInProgress error response returned to the resharding coordinator will lead the config server primary to fassert(). While the fassert(5277000) is an indicator of this issue occurring, it isn't guaranteed that any shards will still be in a state to detect the resharding coordinator having delivered different decisions to different shards.

{"t":{"$date":"2021-11-14T16:37:49.291+00:00"},"s":"E",  "c":"ASSERT",   "id":4457000, "ctx":"conn84","msg":"Tripwire assertion","attr":{"error":{"code":338,"codeName":"ReshardCollectionInProgress","errmsg":"Attempted to commit the resharding operation in an incorrect state"},"location":"{fileName:\"src/mongo/db/s/resharding/resharding_recipient_service.cpp\", line:918, functionName:\"operator()\"}"}}
{"t":{"$date":"2021-11-14T16:38:00.557+00:00"},"s":"F",  "c":"RESHARD",  "id":5277000, "ctx":"ReshardingCoordinatorService-1","msg":"Unrecoverable error past the point resharding was guaranteed to succeed","attr":{"error":"ReshardCollectionInProgress: Failed command { _shardsvrCommitReshardCollection: \"reshardingDb.coll\", reshardingUUID: UUID(\"4755e8fb-35ab-4306-b832-c3a81b44b8d1\"), writeConcern: { w: \"majority\" }, $audit: { $impersonatedUsers: [ { user: \"__system\", db: \"local\" } ], $impersonatedRoles: [] } } for database 'admin' on shard 'shard1-recipient0' :: caused by :: Attempted to commit the resharding operation in an incorrect state"}}

Thank you to chuck.zhang for discovering this issue while working on the automation restore procedure (which has the config server being started up in the aborting state for the resharding operation).

is related to

SERVER-61482 Updates to config.reshardingOperations wait for PrimaryOnlyService to be rebuilt while holding oplog slot, stalling replication on config server indefinitely

Closed

SERVER-61473 Resharding coordinator calls ReshardingMetrics::onCompletion() multiple times on transient errors, leading to config server crash

Closed

SERVER-50937 Make resharding coordinator support recovery

Closed

SERVER-52770 Add abortReshardCollection command for users to cancel the resharding operation

Closed

Assignee:: Max Hirschhorn
Reporter:: Max Hirschhorn
Participants:: Githook User, Max Hirschhorn
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Nov 15 2021 03:29:04 PM UTC
Updated:: Oct 29 2023 09:46:02 PM UTC
Resolved:: Nov 17 2021 12:40:25 PM UTC
Confidence Status Last Update:: 15/Nov/21 3:30 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates