Resharding hangs if there is a repl state change when establishing participants

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • ALL
    • 200
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      To establish participants for resharding, the resharding coordinator sends the FlushRoutingTableCacheUpdates command to all donors and recipients. We eventually will attempt to insert a new state document if a resharding state machine does not exist, but we will swallow any NotPrimary errors.

      In a recent BF we see that we are not able to insert the new state document due to the insert throwing InterruptedDueToReplStateChange, the FlushRoutingTableCacheUpdates command succeeds, and we never end up establishing the recipient shard as a participant. The result is that resharding hangs and the test times out.

      This could be resolved by SPM-4126, but we should consider if it is worth fixing before we do that project.

            Assignee:
            Unassigned
            Reporter:
            Ben Gawel
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: