[SERVER-78528] Background threads in mongos cause mongos_large_catalog_workloads test case to measure performance incorrectly Created: 28/Jun/23  Updated: 10/Jul/23  Resolved: 10/Jul/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Abdul Qadeer Assignee: Backlog - Security Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screenshot 2023-06-28 at 1.47.20 PM.png    
Issue Links:
Depends
Duplicate
duplicates SERVER-77642 Fix fresh_mongos case of mongos_large... Closed
Related
related to SERVER-77642 Fix fresh_mongos case of mongos_large... Closed
Assigned Teams:
Server Security
Operating System: ALL
Participants:
Linked BF Score: 5

 Description   

It was observed from analysis of BF-28472 that the introduction of using transactions in ClusterServerParameterRefresher thread and the AuditSyncJob thread as part of SERVER-74107 cause timing issues when measuring performance in mongos_large_catalog_workloads, specifically for 1000_colls_fresh_mongos test case. It appears that the scheduling of these threads affects latency of some operations enough to cause a regression. A couple patches(patch1, patch2) were run for to prove this with results in screenshot attached.

One possible fix is to disable these threads from running when measuring performance.



 Comments   
Comment by Gabriel Marks [ 10/Jul/23 ]

Closing this as a duplicate because these two tickets are pointing at the same issue.

Comment by Enrico Golfieri [ 07/Jul/23 ]

The proper way to fix this issue would be to disable the clusterParameterRefresh. However we only have one failpoint to do so and it is only implemented in master. The workload project it's independent on the version of mongod running, meaning triggering the failpoint would fail for versions  <=7.0. 

I am thinking the only way to properly fix this would be to implement an outlier detector. The 60% regression detected in BF-28894 is on the average of 5 runs. By looking at the logs, the actual issue is only on 1 run (the first) with a 400% regression followed by 4 runs around the same speed and in line with pre-regression runs. This is probably due to the overlap with  the clusterParameterRefresh that before the revert used to serialise with the refresh for the fresh mongos.  However, excluding outliers might in future prevent us from spotting real issues, which should never happen. I would be curious to know what performance think about that.

Another option it's to simply accept the result. It only depends what the result means and what we want to measure. The background process in a real scenario will be enabled and might overlap with the refresh at any time. So the average it's actually correct.'

In general, we are talking about 2ms slower refresh over 100k collections, which is still a negligible drop in performance. The refresh stays still very fast. From sharding I would personally close the ticket as "won't fix", however I will assign it back to Security so they can evaluate whether it's worth reintroduce the failpoint for 7.0 

Generated at Thu Feb 08 06:38:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.