[SERVER-69068] Further investigate random failures in multi-tenant-passthrough test cases Created: 23/Aug/22 Updated: 16/Oct/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Rishab Joshi (Inactive) | Assignee: | Backlog - Query Execution |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Query Execution
|
||||||||||||
| Participants: | |||||||||||||
| Description |
|
After To mitigate this issue sleep was added. The current explanation for this problem is as follows: We are creating the change collection on every primary node explicitly and independently by issuing the enablement command. As such, when we create change collection on these nodes, their corresponding oplog entries in node 1 would become Timestamp(133456788, 2) and on node 2 Timestamp(123456789, 2). These will also define the start timestamp for each change collection. Since the timestamps are different in both nodes, a getMore with timestamp Timestamp (133456788, 1) on node 2 will cause the change stream to fail. Since there is no entity (like configSvr in the case of mongoS) that orchestrates the creation process, the differences in the timestamps on different nodes seem to be causing this situation. And since the differences in timestamps between nodes are smaller (test-fixture spins up nodes quickly), the sleep causes the periodic-noop writer to write a few oplog entries and bump up the timestamp. The latest oplog timestamp is now later than the beginning of all change collections' first entries and thus prevents failures.
It should be noted that there is already a ticket to enable change stream in mongoQ - SERVER-68341 and that should solve this problem. This is more about further investigating the issue and coming up with a better workaround (not using sleep) for time being. |