[SERVER-57953] _flushReshardingStateChange attempts to refresh shard version while another refresh already pending, leading to invariant failure Created: 22/Jun/21  Updated: 29/Oct/23  Resolved: 08/Jul/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.0.3, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Blake Oler
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-lifecycle
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-58343 Re-enable reshard_collection_failover... Closed
Related
related to SERVER-58063 Alias "flushReshardingStateChanges" a... Closed
is related to SERVER-56638 Fix flushReshardingStateChanges criti... Closed
is related to SERVER-57952 Resharding donor shards cannot comple... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Sprint: Sharding 2021-07-12
Participants:
Linked BF Score: 134
Story Points: 2

 Description   

The _flushReshardingStateChange command doesn't attempt to join with any earlier running shard version refreshes. It was believed for this to be safe due to shard version refreshes not being possible while the critical section is held. However, it is possible for an earlier _flushReshardingStateChange command that had been interrupted by stepdown to have called CollectionShardingRuntime::setShardVersionRecoverRefreshFuture() and for the RecoverRefreshThread to not yet have finished running the shard version refresh where the RecoverRefreshThread would have called CollectionShardingRuntime::resetShardVersionRecoverRefreshFuture().

This leads the _flushReshardingStateChange command to hit this invariant in CollectionShardingRuntime::setShardVersionRecoverRefreshFuture().

Proposed solution: Rather than attempting to make the _flushReshardingStateChange command attempt to join with a shard version refresh triggered by any earlier instances of the command, we could instead introduce a new _shardsvrCommitReshardCollection command analogous to the _shardsvrAbortReshardCollection command introduced in SERVER-56638. The _shardsvrCommitReshardCollection would

  • Call _coordinatorHasDecisionPersisted.emplaceValue().
  • Wait on DonorStateMachine::getCompletionFuture() and RecipientStateMachine::getCompletionFuture().
  • Wait for the latest optime to become majority-committed.

With the proposed _shardsvrCommitReshardCollection command, DonorStateMachine and RecipientStateMachine would additionally need to be changed to call CollectionShardingRuntime::clearFilteringMetadata() prior to releasing the critical section. This is needed to guarantee that a stale mongos cannot get a response of "no documents" after the donor shard has dropped the original collection and would instead be told to refresh its shard version. DonorStateMachine and RecipientStateMachine should additionally call onShardVersionMismatch() after releasing the critical section to eagerly refresh their shard version and learn of the new collection epoch before the first operation for the namespace being resharded comes in.



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 06/Aug/21 ]

Author:

{'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}

Message: SERVER-57953 Change existing participant machines to allow directly calling commit

(cherry picked from commit 7f54d6c00e647eac55e70debf4240a17f6eabb7a)

SERVER-57953 Thread calls to setFilteringMetadata inside the resharding participant machines

(cherry picked from commit a29714ffc0ae3b70242a3665121748da360686ba)

SERVER-57953 Call _shardsvrCommitReshardCollection command

(cherry picked from commit 1bbe9d4fba13374a7fe017bd4d5853c81ee39340)
Branch: v5.0
https://github.com/mongodb/mongo/commit/d2e7cbc02e618889f2d56bffbc021bbb78573d72

Comment by Githook User [ 08/Jul/21 ]

Author:

{'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}

Message: SERVER-57953 Call _shardsvrCommitReshardCollection command
Branch: master
https://github.com/mongodb/mongo/commit/1bbe9d4fba13374a7fe017bd4d5853c81ee39340

Comment by Githook User [ 06/Jul/21 ]

Author:

{'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}

Message: SERVER-57953 Thread calls to setFilteringMetadata inside the resharding participant machines
Branch: master
https://github.com/mongodb/mongo/commit/a29714ffc0ae3b70242a3665121748da360686ba

Comment by Githook User [ 28/Jun/21 ]

Author:

{'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}

Message: SERVER-57953 Change existing participant machines to allow directly calling commit
Branch: master
https://github.com/mongodb/mongo/commit/7f54d6c00e647eac55e70debf4240a17f6eabb7a

Generated at Thu Feb 08 05:43:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.