[SERVER-77296] Ban fsm_workloads/agg_sort.js in concurrency_sharded_causal_consistency_and_balancer Created: 18/May/23 Updated: 29/Oct/23 Resolved: 22/May/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.1.0-rc0, 7.0.0-rc3 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Adi Agrawal | Assignee: | Adi Agrawal |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v7.0, v6.3
|
||||||||
| Steps To Reproduce: | Increase the number of threads (something like 50 should work) for the workload and run the test. |
||||||||
| Sprint: | QE 2023-05-29 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 38 | ||||||||
| Description |
|
We get an assertion failure because we do a find on the secondary which is not fully updated with the primary when running in the concurrency_sharded_causal_consistency_and_balancer suite. |
| Comments |
| Comment by Githook User [ 25/May/23 ] |
|
Author: {'name': 'Adityavardhan Agrawal', 'email': 'adi.agrawal@mongodb.com', 'username': 'Adityav369'}Message: (cherry picked from commit 12f7188f226c4ff47a0f89f04904e8a407cd8cd3) |
| Comment by Githook User [ 22/May/23 ] |
|
Author: {'name': 'Adityavardhan Agrawal', 'email': 'adi.agrawal@mongodb.com', 'username': 'Adityav369'}Message: |
| Comment by Max Hirschhorn [ 18/May/23 ] |
|
I would appreciate an explanation for why the {readConcern: {afterClusterTime: <timestamp of write on primary>}} being issued to the secondary isn't sufficient for providing causal consistency to the agg_sort.js FSM workload. As stated it sounds like a violation of causal consistency in the server which would be bad. Is it because documents which became unowned after a chunk migration commit are being quickly deleted from the secondary due to the much lower orphanCleanupDelaySecs setting of 1 second in our testing infrastructure? If so, then this sounds related to |
| Comment by Adi Agrawal [ 18/May/23 ] |
|
Requesting backport on v7.0 and v6.3 since we see failures on those variants regularly. |