Resharding hang: prevent critical section from never engaging due to oplogEntriesFetched miscount

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Duplicate
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • ALL
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      There is a resahrding bug where the critical section can never engaged due the in-memory oplogEntriesFetched count is larger than the number actual of oplog entries in the resharding oplog buffer (when they are expected to match). This happens through the following mechanism:

      1. Fetcher increments the oplogEntriesFetched count before inserting an oplog entry batch into the buffer collection
      2. The insert into the buffer can fail for a variety of reasons such as a WriteConflict exception.
      3. The fetcher retries inserting the oplog entry batch, but does not reset the count it erroneously added. This then causes the oplogEntriesFetched count to be permanently higher than what's actually in the buffer.  

      If oplogEntriesFetched is permanently higher than what exists in the buffer, then oplogEntriesApplied can never catch up (since oplogEntriesApplied reflects how many buffered entries have been applied and is therefore bounded by the number of entries in the oplog buffer).

      When the oplog application phase runs long enough, this incorrect "remaining work" (fetched - applied) can push the estimated time to apply the remaining work to be above the critical section entry threshold (estimatedTime = timeInApplying * (fetched / applied - 1)), resulting in the critical section never being engaged. 

            Assignee:
            Unassigned
            Reporter:
            Wenqin Ye
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: