[SERVER-77296] Ban fsm_workloads/agg_sort.js in concurrency_sharded_causal_consistency_and_balancer Created: 18/May/23  Updated: 29/Oct/23  Resolved: 22/May/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 7.0.0-rc3

Type: Bug Priority: Major - P3
Reporter: Adi Agrawal Assignee: Adi Agrawal
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0, v6.3
Steps To Reproduce:

Increase the number of threads (something like 50 should work) for the workload and run the test.

Sprint: QE 2023-05-29
Participants:
Linked BF Score: 38

 Description   

We get an assertion failure because we do a find on the secondary which is not fully updated with the primary when running in the concurrency_sharded_causal_consistency_and_balancer suite.



 Comments   
Comment by Githook User [ 25/May/23 ]

Author:

{'name': 'Adityavardhan Agrawal', 'email': 'adi.agrawal@mongodb.com', 'username': 'Adityav369'}

Message: SERVER-77296 Ban fsm_workloads/agg_sort.js in concurrency_sharded_causal_consistency_and_balancer.yml

(cherry picked from commit 12f7188f226c4ff47a0f89f04904e8a407cd8cd3)
Branch: v7.0
https://github.com/mongodb/mongo/commit/7745e4a65ae9f88053abc97f229dc6b4125e75f2

Comment by Githook User [ 22/May/23 ]

Author:

{'name': 'Adityavardhan Agrawal', 'email': 'adi.agrawal@mongodb.com', 'username': 'Adityav369'}

Message: SERVER-77296 Ban fsm_workloads/agg_sort.js in concurrency_sharded_causal_consistency_and_balancer.yml
Branch: master
https://github.com/mongodb/mongo/commit/12f7188f226c4ff47a0f89f04904e8a407cd8cd3

Comment by Max Hirschhorn [ 18/May/23 ]

I would appreciate an explanation for why the {readConcern: {afterClusterTime: <timestamp of write on primary>}} being issued to the secondary isn't sufficient for providing causal consistency to the agg_sort.js FSM workload. As stated it sounds like a violation of causal consistency in the server which would be bad.

Is it because documents which became unowned after a chunk migration commit are being quickly deleted from the secondary due to the much lower orphanCleanupDelaySecs setting of 1 second in our testing infrastructure? If so, then this sounds related to SERVER-30984. I could also imagine the increase in frequency of chunk migration commits during the agg_sort.js FSM workload is a byproduct of SERVER-43099. CC jordi.serra-torrens@mongodb.com

Comment by Adi Agrawal [ 18/May/23 ]

Requesting backport on v7.0 and v6.3 since we see failures on those variants regularly.

Generated at Thu Feb 08 06:35:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.