Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-58855

Improve/Fix the Race Condition in out_max_time_ms.js

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Trivial - P5 Trivial - P5
    • 5.1.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Fully Compatible
    • QE 2021-08-09
    • 127

      There are two timing errors in out_max_time_ms.js that can occasionally lead to BFs. These two lines are highlighted here and marked with L2 and L3 below:

      /* >>> L1: */const awaitShell = startParallelShell(shellStr, conn.port);
      
      /* >>> L2: */ waitForCurOpByFailPointNoNS(failPointConn.getDB("admin"), failPointName);
      
      /* >>> L3: */ assert.commandWorked(maxTimeMsConn.getDB("admin").runCommand(
          {configureFailPoint: "maxTimeNeverTimeOut", mode: "off"}));
      
      // The aggregation running in the parallel shell will hang on the failpoint, burning
      // its time. Wait until the maxTimeMS has definitely expired.
      sleep(maxTimeMS + 2000);
      
      // Now drop the failpoint, allowing the aggregation to proceed. It should hit an
      // interrupt check and terminate immediately.
      assert.commandWorked(
          failPointConn.getDB("admin").runCommand({configureFailPoint: failPointName, mode: "off"}));
      
      // Wait for the parallel shell to finish.
      assert.eq(awaitShell(), 0);
      

      L2 and L3 have a race condition with L1, which occurs rarely.

      Suggested solution #1 to decrease the probability of getting into this BF again:

      diff --git a/jstests/noPassthrough/out_max_time_ms.js b/jstests/noPassthrough/out_max_time_ms.js
      index 36268ff645..0212c30a7e 100644
      --- a/jstests/noPassthrough/out_max_time_ms.js
      +++ b/jstests/noPassthrough/out_max_time_ms.js
      @@ -34,7 +34,7 @@ function forceAggregationToHangAndCheckMaxTimeMsExpires(
           // Use a short maxTimeMS so that the test completes in a reasonable amount of time. We will
           // use the 'maxTimeNeverTimeOut' failpoint to ensure that the operation does not prematurely
           // time out.
      -    const maxTimeMS = 1000 * 2;
      +    const maxTimeMS = 1000 * 4;
       
           // Enable a failPoint so that the write will hang.
           const failpointCommand = {
      @@ -66,6 +66,8 @@ function forceAggregationToHangAndCheckMaxTimeMsExpires(
           shellStr += `(${runAggregate.toString()})();`;
           const awaitShell = startParallelShell(shellStr, conn.port);
       
      +    sleep(1000);
      +
           waitForCurOpByFailPointNoNS(failPointConn.getDB("admin"), failPointName);
       
           assert.commandWorked(maxTimeMsConn.getDB("admin").runCommand(
      @@ -73,7 +75,7 @@ function forceAggregationToHangAndCheckMaxTimeMsExpires(
       
           // The aggregation running in the parallel shell will hang on the failpoint, burning
           // its time. Wait until the maxTimeMS has definitely expired.
      -    sleep(maxTimeMS + 2000);
      +    sleep(maxTimeMS + 4000);
       
           // Now drop the failpoint, allowing the aggregation to proceed. It should hit an
           // interrupt check and terminate immediately.
      

      Suggested solution #2: improve L2 and L3 to make sure there wouldn't be any race conditions.

            Assignee:
            mohammad.dashti@mongodb.com Mohammad Dashti (Inactive)
            Reporter:
            mohammad.dashti@mongodb.com Mohammad Dashti (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: