[SERVER-61483] Resharding coordinator fails to recover abort decision on step-up, attempts to commit operation as success, leading to data inconsistency Created: 15/Nov/21 Updated: 29/Oct/23 Resolved: 17/Nov/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 5.0.0, 5.1.0 |
| Fix Version/s: | 5.2.0, 5.0.5, 5.1.1 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Max Hirschhorn | Assignee: | Max Hirschhorn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-nyc-subteam1 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Backport Requested: |
v5.1, v5.0
|
||||||||||||||||||||||||||||
| Sprint: | Sharding 2021-11-29 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Story Points: | 2 | ||||||||||||||||||||||||||||
| Description |
|
The ReshardingCoordinator relies on an exception being thrown and its .onError() handler being called to trigger its _shardsvrAbortReshardCollection flow. However, the ReshardingCoordinator fails to read the current state of the coordinator document to trigger the _shardsvrAbortReshardCollection flow when an earlier config server primary had already decided the resharding operation must abort. The lack of the .onError() handler being called leads the ReshardingCoordinator to attempt to commit the resharding operation anyway. This is severely problematic because the resulting collection will be incomplete and inconsistent (i.e. lost writes).
Thank you to chuck.zhang for discovering this issue while working on the automation restore procedure (which has the config server being started up in the aborting state for the resharding operation). |
| Comments |
| Comment by Max Hirschhorn [ 17/Nov/21 ] |
|
The 5.0 backport was split into two commits to enable the changes from 963c540 as part of |
| Comment by Githook User [ 17/Nov/21 ] |
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: (cherry picked from commit d9fcd9f124ece9ab0b3a3c46cb6d7052b7282dd2) |
| Comment by Githook User [ 17/Nov/21 ] |
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: (cherry picked from commit d9fcd9f124ece9ab0b3a3c46cb6d7052b7282dd2) |
| Comment by Githook User [ 16/Nov/21 ] |
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: (cherry picked from commit d9fcd9f124ece9ab0b3a3c46cb6d7052b7282dd2) |
| Comment by Githook User [ 15/Nov/21 ] |
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: |