ReshardingOplogFetcher Will Not Be Restarted If Failing Over While Building Indexes

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • 8.3.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • Fully Compatible
    • ALL
    • ClusterScalability Jul21-Aug3, ClusterScalability Aug4-18
    • 0
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      The resharding oplog fetchers are started in _ensureDataReplicationStarted, which is called only in the cloning and applying phases. Since SPM-2322, a new building index phase was added between cloning and applying, but this does not ensure data replication has started because it does not depend on data replication. This means that if a failover occurs while building indexes, we may go a long time (if the index builds take a while) without any oplog fetcher running at all. If this happens, once the oplog fetcher is started as part of the transition to applying, it will need to resume scanning the oplog from a point far in the past, potentially leading to stalls in the fetcher.

      In the worst case, these stalls in the fetcher can be so severe that it can lead the legacy time estimation algorithm based on comparing fetched and applied oplogs to thinking that we are caught up, further leading to a critical section timeout. An example of this can be seen in HELP-78622.

            Assignee:
            Brett Nawrocki
            Reporter:
            Brett Nawrocki
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: