[SERVER-67650] Resharding recipient can return remainingOperationTimeEstimatedSecs=0 when the oplog applier hasn't caught up with the oplog fetcher Created: 29/Jun/22 Updated: 29/Oct/23 Resolved: 11/Aug/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 5.3.0, 5.0.0, 6.0.0-rc12 |
| Fix Version/s: | 5.0.13, 6.0.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Cheahuychou Mao | Assignee: | Andrew Witten (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-nyc-subteam2 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v5.0
|
||||||||||||||||
| Participants: | |||||||||||||||||
| Story Points: | 3 | ||||||||||||||||
| Description |
|
If a recipient receives a _shardsvrReshardingOperationTime command right after it has transitioned to the "applying" state (i.e. oplogEntriesApplied = 0), 'remainingOperationTimeEstimatedSecs' would be calculated using 'bytesCopied' and 'bytesToCopy', and the elapsed time of the "cloning" state. It turns out that the start time of the "cloning" state only gets initialized when a recipient transitions from "create-collection" the "cloning" state. If the ReshardingRecipientService instance is created while the the recipient (i.e. on restart or stepup) is already the "cloning" state, we would skip the state transition so the start time would be uninitialized. Consequently, 'remainingOperationTimeEstimatedSecs' would be 0 since the elapsed time for the cloning state would be 0. The issue here is also that There isn't a mechanism for persisting the start time and recovering it on stepup. Returning remainingOperationTimeEstimatedSecs=0 would cause the coordinator to think that it can start the critical section and the resharding operation to fail with ReshardingCriticalSectionTimeout if the recipient doesn't manage to enter the "strict-consistency" state within the timeout . The same bug exists for the start time for the "applying" state. |
| Comments |
| Comment by Githook User [ 12/Sep/22 ] |
|
Author: {'name': 'Andrew Witten', 'email': 'andrew.witten@mongodb.com', 'username': 'awitten1'}Message: (cherry picked from commit bbc0bb5e9878e48b6bc3b666affbdf102b379450) fix comilation errors after cherry-pick |
| Comment by Andrew Witten (Inactive) [ 12/Sep/22 ] |
|
Code review: BACKPORT-13391 https://github.com/10gen/mongo/pull/7283 Base branch: v5.0. |
| Comment by Max Hirschhorn [ 01/Sep/22 ] |
|
Author: {'name': 'Andrew Witten', 'email': 'andrew.witten@mongodb.com', 'username': 'awitten1'}Message: |