[SERVER-34065] MongoS read throughput regression between 3.4 and 3.6 Created: 22/Mar/18 Updated: 07/Jan/21 Resolved: 07/Jan/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.6.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | Kaloian Manassiev |
| Resolution: | Won't Do | Votes: | 1 |
| Labels: | sharding-causes-bfs-hard | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 0 | ||||||||
| Description |
|
Using a single shard and single mongos shows ~17% read throughput decrease and ~20% read latency increase between 3.4.13 and 3.6.3. The attached read_only_perf_test.js
If only mongos is downgraded to 3.4.13, with the shard still at 3.6.3, the performance is back to normal, so the problem must be on mongos. Same problem is visible between 3.4.13 and master (at the time of 3.7.3):
|
| Comments |
| Comment by Kaloian Manassiev [ 07/Jan/21 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Given the time that it has passed since 3.4/3.6, this ticket is now more of a matter of improving the performance of MongoS rather than fixing a specific bug which caused a regression. Since we are working on laying out a set of projects to improve the performance of MongoS, I am closing this ticket as Won't Do. | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sheeri Cabral (Inactive) [ 30/Jan/20 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
ratika.gandhi we are adding this to quick wins, we can prioritize next sync. | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Charlie Swanson [ 25/Jan/19 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
My above theory about compression protocols did not pan out. This is actually fascinating. I spoke with esha.maharishi a bit to generate some hypotheses, and we found that 3.5.2 performed much worse than the tip of 3.6. After discovering that, I was able to do a manual git bisect to pin down a regression for this commit. Running locally, before that commit I see about 70,000 ops/second. After that commit, it goes down to about 17,000 ops/second. I don't really understand what's going on in that commit or how it could produce this regression, so I'm sending this over to Kal for a look. Here's a spreadsheet I generated during my git bisect that might help. As you can see there, we did eventually recover much of the performance after that commit, so I'm not sure if this was something that later got reverted or what. | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Charlie Swanson [ 18/Jan/19 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
This is actually interesting. I don't see any indication that the cursor establishment logic has changed between 3.4 and 3.6 - on both 3.4 and 3.6 the first command comes back with a non-empty "firstBatch" and a cursorid: 0. However, on 3.6 I do see this when I set the networking log level to 3:
It looks like perhaps the negotiation of the compression protocol is causing an extra round trip? As an experiment, I'll see if I can disable that and show a performance improvement. | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 24/Sep/18 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Given that is in the find path and Esha's explanation I think it is most appropriate that this is handled by the query team. david.storch, let me know if you disagree or if you need some other support from the sharding team. | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 21/Sep/18 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
It would need more investigation, but what I was thinking is that, as of 3.6, executing a find on a router has two phases:
So, returning results for a find might require two round trips to the shard - I don't remember if: 1) the establish cursors phase uses a batchSize of 0 or not 2) if not, whether the results from the establish cursors phase are put into the ClusterClientCursor's stash so that the results are returned immediately... or alternatively whether the ARM returns results immediately if some remotes have already returned results If the regression is only on a sharded cluster, it's probably something to do with the changes to the find path in 3.6. | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gregory McKeon (Inactive) [ 21/Sep/18 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
esha.maharishi kaloian.manassiev can you clarify your comment here? Do we have a server ticket for the actual fix for the regression? | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 13/Apr/18 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
According to Esha, this might be caused by findOne doing extra round trip. |