Fix resharding hang when FlushReshardingStateChangeCmd fails

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • 8.3.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • Fully Compatible
    • ALL
    • Hide

      See the changes here in the flush_resharding_state_change_command.cpp and the concurrency_sharded_replication_with_balancer_and_config_transitions.yml files.

       

       

      Show
      See the changes here in the flush_resharding_state_change_command.cpp and the concurrency_sharded_replication_with_balancer_and_config_transitions.yml files.    
    • ClusterScalability Jul21-Aug3
    • None
    • 3
    • TBD
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      If FlushReshardingStateChangeCmd fails due to a write concern timeout and never refreshes the resharding state it will log that it failed, but not return a status to indicate it failed.

      The resharding coordinator will then not be able to retry and resharding will hang because the resharding participants will not be able to make progress if this command is called during cloning / recipients will not be established.

      See this patch build for a reproducer and logs that show this failure mode.

      The easiest fix is to have this command return a status instead of being void.

              Assignee:
              Cheahuychou Mao
              Reporter:
              Ben Gawel
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: