Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-67650

Resharding recipient can return remainingOperationTimeEstimatedSecs=0 when the oplog applier hasn't caught up with the oplog fetcher

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 5.0.13, 6.0.2
    • Affects Version/s: 5.3.0, 5.0.0, 6.0.0-rc12
    • Component/s: None
    • Fully Compatible
    • ALL
    • v5.0
    • 3

      If a recipient receives a _shardsvrReshardingOperationTime command right after it has transitioned to the "applying" state (i.e. oplogEntriesApplied = 0), 'remainingOperationTimeEstimatedSecs' would be calculated using 'bytesCopied' and 'bytesToCopy', and the elapsed time of the "cloning" state.

      It turns out that the start time of the "cloning" state only gets initialized when a recipient transitions from "create-collection" to the "cloning" state. If the ReshardingRecipientService instance is created while the the recipient (i.e. on restart or stepup) is already the "cloning" state, we would skip the state transition so the start time would be uninitialized. Consequently, 'remainingOperationTimeEstimatedSecs' would be 0 since the elapsed time for the cloning state would be 0. The issue here is also that there isn't a mechanism for persisting the start time and recovering it on stepup. Returning remainingOperationTimeEstimatedSecs=0 would cause the coordinator to think that it can start the critical section and the resharding operation to fail with ReshardingCriticalSectionTimeout if the recipient doesn't manage to enter the "strict-consistency" state within the timeout .

      The same bug exists for the start time for the "applying" state. 

            andrew.witten@mongodb.com Andrew Witten (Inactive)
            cheahuychou.mao@mongodb.com Cheahuychou Mao
            0 Vote for this issue
            5 Start watching this issue