Resharding Recipient That Owns No Data Never Marks Progress When Cloning

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Cluster Scalability
    • ALL
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Resharding recipients will write down the resume token that can be used to resume the $natural scan on the source collection transactionally when writing the latest batch returned by the aggregation.

      However, we obtain these batches by running getMores, which will only return when the batch size is reached.

      In the case where we added a db primary shard as a recipient despite it owning no data (see SERVER-54279 for why we do this), that recipient's getMore will never find any matching documents, never return, and therefore never mark progress. This means that if a failover occurs on that recipient, or one of the donors it is reading from is restarted, the aggregation will fail and must be restarted from the beginning. This can block resharding from making forward progress, especially if the source collection is very large and the collection scan takes a lot of time.

            Assignee:
            Unassigned
            Reporter:
            Brett Nawrocki
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated: