[SERVER-68783] Recipient shard may incorrectly return 0 milliseconds remaining in resharding Created: 12/Aug/22 Updated: 29/Oct/23 Resolved: 01/Sep/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 6.1.0-rc2, 6.2.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Andrew Witten (Inactive) | Assignee: | Brett Nawrocki |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-nyc-subteam1 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Minor Change | ||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||
| Backport Requested: |
v6.1
|
||||||||||||||||||||||||||||||||
| Sprint: | Sharding 2022-08-22, Sharding 2022-09-05 | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Story Points: | 3 | ||||||||||||||||||||||||||||||||
| Description |
|
In response to a _shardsvrReshardingOperationTime command (used for querying the estimated remaining time in a resharding operation) from the resharding coordinator, a recipient shard executes this code, which calls ReshardingMetrics::getRecipientHighEstimateRemainingTimeMillis to compute the estimate of the remaining time. That function may return 0 incorrectly if the shard has just had a failover, and not yet restored all of the metrics. That can happen because the metrics are only partly restored here and partly restored here.
As a result, if a _shardsvrReshardingOperationTime command enters the system at the wrong time, it may observe only partly restored metrics, and the coordinator would be misled into believing that it can begin the critical section.
This is related to |
| Comments |
| Comment by Githook User [ 13/Sep/22 ] |
|
Author: {'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}Message: (cherry picked from commit 54dfa66ba84af002a0b43d2b7f49e0a8119f6c55) |
| Comment by Githook User [ 01/Sep/22 ] |
|
Author: {'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}Message: |
| Comment by Andrew Witten (Inactive) [ 12/Aug/22 ] |
|
I don't think we should return Milliseconds(0) from this function because 0 is a valid return value for remainingMillis that the coordinator will interpret as 0.
The question is this: when we return boost::none from this call, did we mean to return Milliseconds(0) (that is currently the effect of the code)? The failover case is a case where we did not mean to return 0. If there are other cases where we did not mean to return 0, those are bugs as currently implemented. |