[SERVER-46051] Change stream sys-perf regression due to SERVER-45691 Created: 08/Feb/20  Updated: 27/Oct/23  Resolved: 14/Apr/20

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Bernard Gorman Assignee: Bernard Gorman
Resolution: Gone away Votes: 0
Labels: KP42, qexec-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File SERVER-45691-regression.png     PNG File SERVER-45691-throughput.png    
Issue Links:
Depends
depends on SERVER-46821 NetworkingBaton should pass tasks to ... Closed
Problem/Incident
is caused by SERVER-45691 Change Mutex::LockListeners to use a ... Closed
Related
Sprint: Query 2020-02-24, Query 2020-03-09, Query 2020-03-23, Query 2020-04-06, Query 2020-04-20
Participants:
Linked BF Score: 0

 Description   

While running a performance test for the fix to SERVER-44356, I discovered that a far more significant regression had been introduced by SERVER-45691, which changed Mutex::LockListeners to use a vector.

Attached is a graph of the regression; the Evergreen task can be found here.



 Comments   
Comment by David Daly [ 11/Feb/20 ]

Note we also opened BF-16171 as it represented regression still left over after fixing BF-15717. It looks like a separate regression went in during that period and was hidden by BF-15717. 

Comment by David Daly [ 11/Feb/20 ]

The major regression was almost surely BF-15717. SERVER-45691 was supposed to be the fix for it. That ticket made a lot of things slower. 

Comment by Bernard Gorman [ 11/Feb/20 ]

david.daly: I assumed there was no additional BF generated because a regression in these tests had already been registered for SERVER-44356, and that ticket was still in progress. I'd say it would be better to open a new BF and link it to this ticket, since the cause (broadly speaking) has already been identified - we can use this ticket for further discussion.

ben.caimano: since you worked on SERVER-45691, do you have any initial insight into what might have caused this? For context, the graphs above show the average/max time (in seconds) between the time a write is performed and the time the corresponding event is retrieved by change streams. The blue line shows the latency for 60 simultaneous clients performing a mix of inserts, updates, deletes and queries. Each test-case is running a different number of streams alongside these workload clients, so 1_1c means one stream reading changes on one collection, 15_1c means 15 streams reading changes on one collection, etc.

There are also a set of corresponding throughput tests; I've attached a screenshot of the current graph. These measure the number of operations that the workload threads were able to drive to the cluster given the number of active change streams. As you can see, there was a major regression at some point between December 12th and 17th; I'm currently bisecting in order to find the culprit. Your recent change in SERVER-45691 appears to have claimed back some of that regression. It would make sense that this higher write throughput could lead to the increased change stream latency that we see in the former tests, except that the throughput was even higher from August to December, during which time the change stream latency was also much lower. So it seems that at present we have increased latency despite a reduction in throughput.

Generated at Thu Feb 08 05:10:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.