Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.1.0-rc0, 8.0.0-rc9, 7.0.13, 6.0.17
Affects Version/s: 8.0.0-rc0, 8.1.0-rc0
Component/s: None
Labels:
None

Assigned Teams:

Catalog and Routing
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v8.0, v7.3, v7.0, v6.0
Sprint:
CAR Team 2024-06-24
Linked BF Score:
0
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Context:
The test is trying to check that running a transaction with a snapshot time == lastOpTime will cause the primary to write a no-op entry in the oplog, instead of waiting for the internal clock to reach the requested time.

In order to do so, it simply runs as follows:

Given a 2-shard cluster:

Insert on shard0 at time T
Checks shard1 has a lastOpTime < T
Runs a find on shard1 so that lastOpTime == T
Runs a transaction on shard1 with snapshot time T

The problem:

If any insertion on the oplog occurs in between point 3 and point 4, you will skip the no-op insertion. Causing the test to fail.

The way the test is currently written is very fragile.

The reason why now the test is failing more often is because in the “all-feature-flag” variant we track the collection on creation, causing an asynchronous refresh at the end of the creation .

The test was first banned as this refresh was causing constant failures, and re-enabled by ~~SERVER-87461~~ forcing to wait on any possible ongoing collection refresh before starting.

However, the solution did not take into account database refreshes.

In BF-28563 the test is failing due to a database refresh which will perform a no-op write in the oplog to verify to be the primary before storing on disk the refreshed data.

Possible Solution:

The check seems too strict and any possible future developement of the database might interfere with the test.

The no-op write is made to ensure the transaction does not hang waiting for the snapshot time. In the moment the transaction completes, we can assume the no-op has been written.

Therefore we can relax the check and verify the transaction completes and stop checking for the no-op write. We can use a assert.soon to prevent the test from hanging too long in case of error.

Note that this solution assumes the read concern is working properly. However, we have several tests that provide those guaratees such as:

jstests/noPassthrough/readConcern_atClusterTime.js

jstests/noPassthrough/readConcern_atClusterTime_snapshot_selection.js

Bonus:

The test is also very hard to debug. Would be nice to add some extra logs and rename some variable.

Assignee:: Enrico Golfieri
Reporter:: Enrico Golfieri
Participants:: Enrico Golfieri, Githook User
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Jun 10 2024 04:37:09 PM UTC
Updated:: Aug 02 2024 07:41:15 AM UTC
Resolved:: Jun 13 2024 03:42:35 PM UTC
Confidence Status Last Update:: 11/Jun/24 3:12 PM

Details

Description

Attachments

Forms

Activity

People

Dates