[SERVER-68783] Recipient shard may incorrectly return 0 milliseconds remaining in resharding Created: 12/Aug/22  Updated: 29/Oct/23  Resolved: 01/Sep/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 6.1.0-rc2, 6.2.0-rc0

Type: Bug Priority: Major - P3
Reporter: Andrew Witten (Inactive) Assignee: Brett Nawrocki
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by COMPASS-6094 Investigate changes in SERVER-68783: ... Closed
Documented
is documented by DOCS-15602 [Server] Investigate changes in SERVE... Closed
Related
related to SERVER-67653 Resharding coordinator can incorrectl... Closed
related to SERVER-70079 remove optional_util::setOrAdd Closed
Backwards Compatibility: Minor Change
Operating System: ALL
Backport Requested:
v6.1
Sprint: Sharding 2022-08-22, Sharding 2022-09-05
Participants:
Story Points: 3

 Description   

In response to a _shardsvrReshardingOperationTime command (used for querying the estimated remaining time in a resharding operation) from the resharding coordinator, a recipient shard executes this code, which calls ReshardingMetrics::getRecipientHighEstimateRemainingTimeMillis to compute the estimate of the remaining time.  That function may return 0 incorrectly if the shard has just had a failover, and not yet restored all of the metrics.   That can happen because the metrics are only partly restored here and partly restored here.

 

As a result, if a _shardsvrReshardingOperationTime command enters the system at the wrong time, it may observe only partly restored metrics, and the coordinator would be misled into believing that it can begin the critical section.

 

This is related to SERVER-67653, but is not the same because in that ticket the coordinator incorrectly treats an omitted remainingMillis field as 0 remainingMillis.  In this ticket, the recipient incorrectly returns 0 remainingMillis.



 Comments   
Comment by Githook User [ 13/Sep/22 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-68783 Disambiguate 0 time estimate from no estimate in resharding

(cherry picked from commit 54dfa66ba84af002a0b43d2b7f49e0a8119f6c55)
Branch: v6.1
https://github.com/mongodb/mongo/commit/f0aa22b7dd7277b5d2546574e6d9ac4fd27cc7b4

Comment by Githook User [ 01/Sep/22 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-68783 Disambiguate 0 time estimate from no estimate in resharding
Branch: master
https://github.com/mongodb/mongo/commit/54dfa66ba84af002a0b43d2b7f49e0a8119f6c55

Comment by Andrew Witten (Inactive) [ 12/Aug/22 ]

I don't think we should return Milliseconds(0) from this function because 0 is a valid return value for remainingMillis that the coordinator will interpret as 0.

 

The question is this: when we return boost::none from this call, did we mean to return Milliseconds(0) (that is currently the effect of the code)?  The failover case is a case where we did not mean to return 0.  If there are other cases where we did not mean to return 0, those are bugs as currently implemented.

Generated at Thu Feb 08 06:11:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.