[SERVER-69068] Further investigate random failures in multi-tenant-passthrough test cases Created: 23/Aug/22  Updated: 16/Oct/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Rishab Joshi (Inactive) Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-68341 Implement enable/disable command for ... Backlog
Assigned Teams:
Query Execution
Participants:

 Description   

After SERVER-66631 change_stream_multitenant_sharded_cluster_passthrough randomly started failing with different test cases. The failures were related to ChangeStreamHistoryLost.
Evergreen Link here.

To mitigate this issue sleep was added.

The current explanation for this problem is as follows:

We are creating the change collection on every primary node explicitly and independently by issuing the enablement command.
Each node's latest oplog timestamp might be different, for eg, the latest oplog timestamp for node1 might be Timestamp (133456788, 1) and for the other, it could be Timestamp (123456789, 1).

As such, when we create change collection on these nodes, their corresponding oplog entries in node 1 would become Timestamp(133456788, 2) and on node 2 Timestamp(123456789, 2). These will also define the start timestamp for each change collection.

Since the timestamps are different in both nodes, a getMore with timestamp Timestamp (133456788, 1) on node 2 will cause the change stream to fail.

Since there is no entity (like configSvr in the case of mongoS) that orchestrates the creation process, the differences in the timestamps on different nodes seem to be causing this situation.

And since the differences in timestamps between nodes are smaller (test-fixture spins up nodes quickly), the sleep causes the periodic-noop writer to write a few oplog entries and bump up the timestamp. The latest oplog timestamp is now later than the beginning of all change collections' first entries and thus prevents failures.

 

It should be noted that there is already a ticket to enable change stream in mongoQ - SERVER-68341  and that should solve this problem. This is more about further investigating the issue and coming up with a better workaround (not using sleep) for time being.


Generated at Thu Feb 08 06:12:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.