Context:
The test is trying to check that running a transaction with a snapshot time == lastOpTime will cause the primary to write a no-op entry in the oplog, instead of waiting for the internal clock to reach the requested time.
In order to do so, it simply runs as follows:
Given a 2-shard cluster:
- Insert on shard0 at time T
- Checks shard1 has a lastOpTime < T
- Runs a find on shard1 so that lastOpTime == T
- Runs a transaction on shard1 with snapshot time T
The problem:
If any insertion on the oplog occurs in between point 3 and point 4, you will skip the no-op insertion. Causing the test to fail.
The way the test is currently written is very fragile.
The reason why now the test is failing more often is because in the “all-feature-flag” variant we track the collection on creation, causing an asynchronous refresh at the end of the creation .
The test was first banned as this refresh was causing constant failures, and re-enabled by SERVER-87461 forcing to wait on any possible ongoing collection refresh before starting.
However, the solution did not take into account database refreshes.
In BF-28563 the test is failing due to a database refresh which will perform a no-op write in the oplog to verify to be the primary before storing on disk the refreshed data.
Possible Solution:
The check seems too strict and any possible future developement of the database might interfere with the test.
The no-op write is made to ensure the transaction does not hang waiting for the snapshot time. In the moment the transaction completes, we can assume the no-op has been written.
Therefore we can relax the check and verify the transaction completes and stop checking for the no-op write. We can use a assert.soon to prevent the test from hanging too long in case of error.
Note that this solution assumes the read concern is working properly. However, we have several tests that provide those guaratees such as:
jstests/noPassthrough/readConcern_atClusterTime.js
jstests/noPassthrough/readConcern_atClusterTime_snapshot_selection.js
Bonus:
The test is also very hard to debug. Would be nice to add some extra logs and rename some variable.