-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Fully Compatible
-
ALL
-
Query 2020-09-21, Query 2020-10-05
-
22
-
None
-
None
-
None
-
None
-
None
-
None
-
None
As described in BF-18570, jstests/noPassthrough/read_concern_snapshot_yielding.js randomly hangs and timeouts with following message:
assert.soon failed: () => { results = adminDB.aggregate([{$currentOp: options}, {$match: filter}]).toArray(); return results.length > 0; } : Failed to find a matching op for filter: { "$and" : [ { "ns" : "test.coll" }, { "op" : "update" }, { "$or" : [ { "failpointMsg" : "setInterruptOnlyPlansCheckForInterruptHang" }, { "msg" : "setInterruptOnlyPlansCheckForInterruptHang" } ] } ] }in currentOp output: [ ]
The root cause of the problem is that setInterruptOnlyPlansCheckForInterruptHang fail point trapping mechanism is vulnerable to CPU time scheduling unevenness. In test jstests/noPassthrough/read_concern_snapshot_yielding.js https://github.com/mongodb/mongo/blob/fd8e132ebe4d544a5c99d81fffa2ffb8fcb3f841/jstests/noPassthrough/read_concern_snapshot_yielding.js#L31 assumes that commands will yield on the second try, but actually it can yield on the first try if the 10ms time window closes (https://github.com/mongodb/mongo/blob/fd8e132ebe4d544a5c99d81fffa2ffb8fcb3f841/src/mongo/util/elapsed_tracker.cpp#L47). This causes the transaction start command to block on setInterruptOnlyPlansCheckForInterruptHang fail point, which is not expected. The thread then does not reach a point where it can block on setInterruptOnlyPlansCheckForInterruptHang fail point as expected, and then the main test thread timeouts.