[SERVER-77642] Fix fresh_mongos case of mongos_large_catalog_workloads Created: 31/May/23  Updated: 02/Oct/23  Resolved: 02/Oct/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Gabriel Marks Assignee: Enrico Golfieri
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-78528 Background threads in mongos cause mo... Closed
Related
is related to SERVER-78528 Background threads in mongos cause mo... Closed
Assigned Teams:
Server Security
Operating System: ALL
Sprint: Sharding EMEA 2023-06-26, Sharding EMEA 2023-07-24, Sharding EMEA 2023-08-07, Sharding EMEA 2023-08-21, Sharding EMEA 2023-10-16
Participants:
Linked BF Score: 5
Story Points: 2

 Description   

As the fresh_mongos case does not wait for stable performance, this test can be heavily thrown off whenever new startup tasks are added (see linked BF for an example of a change in a startup task causing a regression due to timing overlap with this test). Having a wider tolerance for such cases, reworking this case, or removing it could all be plausible solutions.



 Comments   
Comment by Enrico Golfieri [ 02/Oct/23 ]

Closing the ticket as after an offline discussion the related BF-28894 won't solve. More information in the BF's comments. To summarize:

  • SERVER-74107 made the mongos_large_catalog test more stable and caused an improvement of 38% of speed
  • The only reason we experienced a drop in performance was due to the revert of that ticket on 7.0.
  • The loss/gain in performance is still negligible in terms of absolute values (2ms only)
Comment by Adam Rayner [ 29/Sep/23 ]

Hi sergi.mateo-bellido@mongodb.com and enrico.golfieri@mongodb.com, do you think we can close this ticket, or assign to backlog-server-sharding-emea for further investigation? If I understand correctly:

  1. gabriel.marks@mongodb.com introduced a change that caused a perf improvement of 38%, and then reverted this change which caused a 60% performance "regression" in the same perf test due to the absence of the introduced change
  2. We aren't conclusively sure at this point what caused the improvement, nor what caused the -22% net performance regression when it was reverted, but the code in question was not included in 7.0 in the first place
  3. The perf test is stable and at (presumably) the previous baseline established before gabriel.marks@mongodb.com's change

Given these points, if the continued discussion now is to see if we can extract some of that 38% increase that we gained with introducing the patch, it seems likely there will be some additional work required per enrico.golfieri@mongodb.com's recent comment, and think it makes sense to have Sharding EMEA / Catalog and Routing take over any such investigation going forward - any further discussion is likely to just be speculative without additional engineer work IMO and I think our perspective aligns with sergi.mateo-bellido@mongodb.com's comment as well:

From our side going back to the previous values, as if SERVER-74107 was never added to 7.0 is not a problem.

Happy to discuss further of course, thanks!

Comment by Sergi Mateo Bellido [ 24/Jul/23 ]

Summary

It looks like the refresh of the setClusterParameter is not the reason of the drop. Overall, this benchmark was stable until the introduction of SERVER-74107 which added a performance improvement. For some reason, the same ticket was reverted in 7.0, introducing in this case a performance degradation which was just a go back to the previous performance results (i.e. without SERVER-74107).

I was wondering whether this specific early-exit on the synchronize function would be the reason of the performance improvement and performance degradation when it is missing.

From our side going back to the previous values, as if SERVER-74107 was never added to 7.0 is not a problem.

Passing it to security and cc: gabriel.marks@mongodb.com  to verify our hypothesis.

Comment by Gabriel Marks [ 10/Jul/23 ]

enrico.golfieri@mongodb.com, on our side, we are not going to reintroduce the failpoint for 7.0. I hesitate to close the ticket as won't fix, however, because the core issue, of the test producing inconsistent results that don't necessarily indicate a bug, is not fixed. I would say that if sharding is not planning to rework this test, it would be ideal if the test was flagged so that it doesn't produce BFs in the future. Reassigning to you to evaluate what you want to do with the test.

Comment by Gabriel Marks [ 30/Jun/23 ]

lamont.nelson@mongodb.com I think we already have "stable_mongos" tests. This case is supposed to be explicitly for a fresh mongos, which is why it produces inconsistent results.

Comment by Lamont Nelson [ 30/Jun/23 ]

Maybe a warmup/cooldown period would help?

Generated at Thu Feb 08 06:36:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.