[SERVER-58855] Improve/Fix the Race Condition in out_max_time_ms.js Created: 26/Jul/21  Updated: 29/Oct/23  Resolved: 04/Aug/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.1.0-rc0

Type: Improvement Priority: Trivial - P5
Reporter: Mohammad Dashti (Inactive) Assignee: Mohammad Dashti (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-60586 out_max_time_ms.js does not correctly... Closed
Backwards Compatibility: Fully Compatible
Sprint: QE 2021-08-09
Participants:
Linked BF Score: 127

 Description   

There are two timing errors in out_max_time_ms.js that can occasionally lead to BFs. These two lines are highlighted here and marked with L2 and L3 below:

/* >>> L1: */const awaitShell = startParallelShell(shellStr, conn.port);
 
/* >>> L2: */ waitForCurOpByFailPointNoNS(failPointConn.getDB("admin"), failPointName);
 
/* >>> L3: */ assert.commandWorked(maxTimeMsConn.getDB("admin").runCommand(
    {configureFailPoint: "maxTimeNeverTimeOut", mode: "off"}));
 
// The aggregation running in the parallel shell will hang on the failpoint, burning
// its time. Wait until the maxTimeMS has definitely expired.
sleep(maxTimeMS + 2000);
 
// Now drop the failpoint, allowing the aggregation to proceed. It should hit an
// interrupt check and terminate immediately.
assert.commandWorked(
    failPointConn.getDB("admin").runCommand({configureFailPoint: failPointName, mode: "off"}));
 
// Wait for the parallel shell to finish.
assert.eq(awaitShell(), 0);

L2 and L3 have a race condition with L1, which occurs rarely.

Suggested solution #1 to decrease the probability of getting into this BF again:

diff --git a/jstests/noPassthrough/out_max_time_ms.js b/jstests/noPassthrough/out_max_time_ms.js
index 36268ff645..0212c30a7e 100644
--- a/jstests/noPassthrough/out_max_time_ms.js
+++ b/jstests/noPassthrough/out_max_time_ms.js
@@ -34,7 +34,7 @@ function forceAggregationToHangAndCheckMaxTimeMsExpires(
     // Use a short maxTimeMS so that the test completes in a reasonable amount of time. We will
     // use the 'maxTimeNeverTimeOut' failpoint to ensure that the operation does not prematurely
     // time out.
-    const maxTimeMS = 1000 * 2;
+    const maxTimeMS = 1000 * 4;
 
     // Enable a failPoint so that the write will hang.
     const failpointCommand = {
@@ -66,6 +66,8 @@ function forceAggregationToHangAndCheckMaxTimeMsExpires(
     shellStr += `(${runAggregate.toString()})();`;
     const awaitShell = startParallelShell(shellStr, conn.port);
 
+    sleep(1000);
+
     waitForCurOpByFailPointNoNS(failPointConn.getDB("admin"), failPointName);
 
     assert.commandWorked(maxTimeMsConn.getDB("admin").runCommand(
@@ -73,7 +75,7 @@ function forceAggregationToHangAndCheckMaxTimeMsExpires(
 
     // The aggregation running in the parallel shell will hang on the failpoint, burning
     // its time. Wait until the maxTimeMS has definitely expired.
-    sleep(maxTimeMS + 2000);
+    sleep(maxTimeMS + 4000);
 
     // Now drop the failpoint, allowing the aggregation to proceed. It should hit an
     // interrupt check and terminate immediately.

Suggested solution #2: improve L2 and L3 to make sure there wouldn't be any race conditions.



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 04/Aug/21 ]

Author:

{'name': 'Mohammad Dashti', 'email': 'mdashti@gmail.com', 'username': 'mdashti'}

Message: SERVER-58855 Increase the number of inserted documents in `out_max_time_ms.js`
Branch: master
https://github.com/mongodb/mongo/commit/d091d1167a5c86b3ccc005a14d4d48bcc3890dc3

Generated at Thu Feb 08 05:45:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.