[SERVER-78528] Background threads in mongos cause mongos_large_catalog_workloads test case to measure performance incorrectly Created: 28/Jun/23 Updated: 10/Jul/23 Resolved: 10/Jul/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Abdul Qadeer | Assignee: | Backlog - Security Team |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Server Security
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 5 | ||||||||||||||||||||
| Description |
|
It was observed from analysis of BF-28472 that the introduction of using transactions in ClusterServerParameterRefresher thread and the AuditSyncJob thread as part of SERVER-74107 cause timing issues when measuring performance in mongos_large_catalog_workloads, specifically for 1000_colls_fresh_mongos test case. It appears that the scheduling of these threads affects latency of some operations enough to cause a regression. A couple patches(patch1, patch2) were run for to prove this with results in screenshot attached. One possible fix is to disable these threads from running when measuring performance. |
| Comments |
| Comment by Gabriel Marks [ 10/Jul/23 ] |
|
Closing this as a duplicate because these two tickets are pointing at the same issue. |
| Comment by Enrico Golfieri [ 07/Jul/23 ] |
|
The proper way to fix this issue would be to disable the clusterParameterRefresh. However we only have one failpoint to do so and it is only implemented in master. The workload project it's independent on the version of mongod running, meaning triggering the failpoint would fail for versions <=7.0. I am thinking the only way to properly fix this would be to implement an outlier detector. The 60% regression detected in BF-28894 is on the average of 5 runs. By looking at the logs, the actual issue is only on 1 run (the first) with a 400% regression followed by 4 runs around the same speed and in line with pre-regression runs. This is probably due to the overlap with the clusterParameterRefresh that before the revert used to serialise with the refresh for the fresh mongos. However, excluding outliers might in future prevent us from spotting real issues, which should never happen. I would be curious to know what performance think about that. Another option it's to simply accept the result. It only depends what the result means and what we want to measure. The background process in a real scenario will be enabled and might overlap with the refresh at any time. So the average it's actually correct.' In general, we are talking about 2ms slower refresh over 100k collections, which is still a negligible drop in performance. The refresh stays still very fast. From sharding I would personally close the ticket as "won't fix", however I will assign it back to Security so they can evaluate whether it's worth reintroduce the failpoint for 7.0 |