Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-42602

Guarantee that unconditional step down will not happen due to slow node restarts in rollback_fuzzer_[un]clean_shutdowns suites.

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2.1, 4.3.1
    • Component/s: Replication
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.2
    • Sprint:
      Repl 2019-08-26, Repl 2019-09-09
    • Linked BF Score:
      19

      Description

      There are 2 kinds of phases in rollback fuzzer test suites.

      1. State Transition Phase - RollbackTest transitions to predefined state before the rollback fuzzer gets into workload execution phase.
      2. Workload Execution Phase - rollback fuzzer executes some list of random commands (including restartNode cmd which can result in change of primary) on the replica set.

      After the RollbackTest transitions to "transitionToSyncSourceOperationsDuringRollback" state, we break the assumption mentioned here in rollback fuzzer. After the "transitionToSyncSourceOperationsDuringRollback" state, the topology looks like below.

      [CurSecondary/Node to be rolled back]
              |
              |
              |
      [CurPrimary]-------- [TieBreakerNode]

      Once the curSecondary node gets rolled back successfully (i.e) caught up to curPrimary, restarting a curPrimary can result in curSecondary to become the new primary. As a result, during workload execution phase, unconditional step down can happen due to slow planned node restarts (i.e. node restarts taking long time). And, that  leads to undesired behavior in  rollback_fuzzer_[un]clean_shutdown suites. So, in order to fix the issue, we should have below 2 contracts.

      1) During workload execution phase, unconditional step down can happen only due to some transient network issues and not because of slow planned node restarts (i.e. node restarts taking long time).

      2) Restarting nodes by rollback fuzzer can change the original primary only if all the 3 nodes are connected. 

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              suganthi.mani Suganthi Mani
              Reporter:
              suganthi.mani Suganthi Mani
              Participants:
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: