Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-67653

Resharding coordinator can incorrectly conclude that it can start the critical section although on one recipient the oplog applier hasn't caught up with the oplog fetcher

    • Fully Compatible
    • ALL
    • v6.1, v6.0, v5.0
    • 3

      The resharding coordinator queries all recipients for an estimation of the remaining time for the active resharding operation of participant shards in CoordinatorCommitMonitor::queryRemainingOperationTimeForRecipients using command _shardsvrReshardingOperationTime.

       

      The recipient shards handle that command here.  The function ReshardingMetrics::getOperationRemainingTime could possibly return boost::none.  In that case, the "recipientMillis" field of the recipient's response to the coordinator will be omitted.  

       

      If all participants were to omit this field, then these if statements won't be entered and when the participant reads the max remaining time (here) remainingTimes.max would still be 0 and remainingTimes.min would still be Milliseconds::max (as they were initialized).  In particular, the invariant mentioned here would fail.

       

      The effect of this, is that a recipient returning an empty "remainingMillis" field is equivalent to it returning "remainingMillis: 0."  This is a bug in at least one case: where _shardsvrReshardingOperationTime is run against a recipient shard before the recipient shard has restored its metrics (during a step up).

       

      As a result, the coordinator would, believing that recipientMillis was under the threshold for all recipients, prematurely begin the critical section, and the resharding operation would fail with ReshardingCriticalSectionTimeout if the recipient above doesn't manage to enter the "strict-consistency" state within the timeout. 

            Assignee:
            andrew.witten@mongodb.com Andrew Witten (Inactive)
            Reporter:
            cheahuychou.mao@mongodb.com Cheahuychou Mao
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: