-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Unknown
-
None
-
Component/s: Unified Test Runner
-
None
-
Needed - No Spec Changes
Summary
aggregate-write-readPreference.json "Aggregate with $out includes read preference for 5.0+ server" test case is racy on replicaset environments especially for 8.0+ servers. MongoDB 8.0 introduced a change to how writeConcern works: now the operation is considered done when oplog was successfully replicated to necessary number of secondaries - it means the data might not be available on secondary yet, because oplog will be applied a little later. Find more in the SPM-3489.
In this particular test case this means that initialData could be not yet available when we are running aggregate operation with $out stage on secondary.
On CSharp Driver's CI we can see the following error time-to-time:
Command aggregate failed: Executor error during aggregate command on namespace: db0.coll0 :: caused by :: collection dropped. UUID 82b204fc-dcc1-4782-847e-38ca43c284d6.
The error is caused by the race condition when Unified Test Runner tries to execute the aggregate on secondary node, when drop collection was already applied, but insert operation from initialData was not yet applied.
Java team confirmed they see the same behavior.
I've prototyped a solution to mitigate the problem by using causally consistent session to insert the initialData and then execute any operation supporting afterClusterTime on each of secondaries using the same session. This way secondary has to wait for local clusterTime reach the value from the last insert operation on master. This has to be done as part of initialData population because clasterTime tracking encapsulated in session, so there is no way to "gossip" the last clusterTime to operation on another mongoClient. However executing such on for every test case might be relatively expensive, I suggest to add a new setting to initialData to enable such validation for tests that expecting to work with secondary nodes.
Motivation
How does this affect the end user?
End users are not affected.
How likely is it that this problem or use case will occur?
We can see 1-2 failed variants with such error per test run on EG.
Is this issue urgent?
Nope
Is this ticket required by a downstream team?
Nope
Is this ticket only for tests?
Yes