Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-47343

Add/update failpoints to make cancelCommand integration tests more robust

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4.0-rc0, 4.7.0
    • Component/s: Internal Code
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.4
    • Sprint:
      Service arch 2020-04-20
    • Linked BF Score:
      23

      Description

      • It is incorrect for CancelRemotelyTimeout to wait for the _killOperations command issued by cancelCommand to be killed when it times out since we don't kill operations without operation keys (i.e. without hedged options). The test rarely fails or hangs since the currentOp check and the _killOperations command both have the timeout of 1 second. However, there is a race between when the _killOperations operation starts and when the currentOp check starts. That is, there is a chance that the currentOp check returns without waiting because there is no _killOperations operation running yet, and so the test fails as the number of timed out commands is zero. To solve this, we should remove the currentOp check and instead use a failpoint to wait for the _killOperations command to timeout. That failpoint should be placed at the end of NetworkInterfaceTL::tryFinish and have a predicate that only returns true when command name and error code match the response status and command name for the request.
      • The failpoint "networkInterfaceAfterAcquireConn" was added to ensure that cancelCommand does not start running until the command acquires a connection (otherwise, no _killOperations command will be issued). However, it does not check the command name so other commands in the background can also enter this failpoint if they happen to run while the test is in this block. So to avoid this, we should replace the failpoint with a function similar to waitForCommand that runs currentOp repeatedly until there is a matching operation is running.

        Attachments

          Activity

            People

            Assignee:
            cheahuychou.mao Cheahuychou Mao
            Reporter:
            cheahuychou.mao Cheahuychou Mao
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: