|
david.daly: I assumed there was no additional BF generated because a regression in these tests had already been registered for SERVER-44356, and that ticket was still in progress. I'd say it would be better to open a new BF and link it to this ticket, since the cause (broadly speaking) has already been identified - we can use this ticket for further discussion.
ben.caimano: since you worked on SERVER-45691, do you have any initial insight into what might have caused this? For context, the graphs above show the average/max time (in seconds) between the time a write is performed and the time the corresponding event is retrieved by change streams. The blue line shows the latency for 60 simultaneous clients performing a mix of inserts, updates, deletes and queries. Each test-case is running a different number of streams alongside these workload clients, so 1_1c means one stream reading changes on one collection, 15_1c means 15 streams reading changes on one collection, etc.
There are also a set of corresponding throughput tests; I've attached a screenshot of the current graph. These measure the number of operations that the workload threads were able to drive to the cluster given the number of active change streams. As you can see, there was a major regression at some point between December 12th and 17th; I'm currently bisecting in order to find the culprit. Your recent change in SERVER-45691 appears to have claimed back some of that regression. It would make sense that this higher write throughput could lead to the increased change stream latency that we see in the former tests, except that the throughput was even higher from August to December, during which time the change stream latency was also much lower. So it seems that at present we have increased latency despite a reduction in throughput.
|