[SERVER-59719] shardsvr{Commit, Abort}ReshardCollection may return unrecoverable error on stepdown, leading to fassert() on config server Created: 01/Sep/21 Updated: 29/Oct/23 Resolved: 10/Nov/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 5.0.2 |
| Fix Version/s: | 5.2.0, 5.0.5, 5.1.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Luis Osta (Inactive) | Assignee: | Brett Nawrocki |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LFR-BUG | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v5.1, v5.0
|
||||||||||||||||
| Sprint: | Sharding 2021-09-06, Sharding 2021-10-04, Sharding 2021-10-18, Sharding 2021-11-01, Sharding 2021-11-15 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 151 | ||||||||||||||||
| Story Points: | 1 | ||||||||||||||||
| Description |
|
Background It is possible that if a stepdown occurs while the resharding operation is in progress, that the opCtx doing the commit will be killed before the opCtx handling the command does. Which means, for instance, that the ShardsvrCommitReshardCollectionCommand could reach the final uassert even though the Resharding(Recipient/Donor)Service was not able to finish committing (because it was interrupted). Problem
Proposed Solution Do a no-op write using doNoopWrite before performing the sanity check to assure that the state document has been deleted. This will make sure that the operation hasn't been interrupted before asserting that there are no state documents left. |
| Comments |
| Comment by Githook User [ 11/Nov/21 ] |
|
Author: {'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}Message: ShardsvrCommitReshardCollectionCommand's commit() call and (cherry picked from commit cca75006b85690faa641a15dfc9940d2a2add52d) |
| Comment by Githook User [ 11/Nov/21 ] |
|
Author: {'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}Message: ShardsvrCommitReshardCollectionCommand's commit() call and (cherry picked from commit cca75006b85690faa641a15dfc9940d2a2add52d) |
| Comment by Githook User [ 10/Nov/21 ] |
|
Author: {'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}Message: ShardsvrCommitReshardCollectionCommand's commit() call and |
| Comment by Luis Osta (Inactive) [ 07/Sep/21 ] |
|
dianna.hohensee After talking with max.hirschhorn I think its probably for the best if we leave supportsLockFreeRead alone and instead add a no-op write before the reads in:
|
| Comment by Dianna Hohensee (Inactive) [ 02/Sep/21 ] |
|
I haven't gone to look at what and how _alwaysInterruptAtStepDownOrUp works yet, but I have some initial thoughts. 1) one of the goals of the lock-free project was specifically to allow reads to run concurrently with stepdown/up instead of stalling. 2) we eventually want to move away entirely from locked reads, so I don't think falling back to them is a sustainable solution. |
| Comment by Luis Osta (Inactive) [ 02/Sep/21 ] |
|
dianna.hohensee henrik.edin What do y'all think about the proposed fix? |