[SERVER-60680] mongos CPU 100% even with no traffic Created: 13/Oct/21 Updated: 22/Jun/22 Resolved: 27/Jan/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.4.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Daniele Tessaro | Assignee: | Edwin Zhou |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | mongos | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
We have a big cluster (>7TB data), hosting a sharded collection. The collection was created as sharded, but the cluster initially only had 1 shard (call it shard A), and 3 mongos. Today we added shard B. We disabled the balancer, added the shard in the cluster, defined 2 zones that fully covered the key space, then enabled the balancer. The zones were defined in a way such that only few chunks should have been migrated to shard B (as an initial test). From the moment the balancer was enabled, CPU on the 3 mongos spiked to 200% (100% on both CPUs), causing major disruption to the incoming traffic. The balancer slowly moved chunks from shard A to shard B, at speed of a few seconds per chunk. We thought the balancing operation was being the issue. So we disabled the balancer, but nothing changed. We redefined the zones so that all chunks should have been migrated back to shard A and re-enabled the balancer, because we thought the presence of data on multiple shards was the culprit. This made no difference. Even when all chunks were migrated back to shard A, CPU on mongos was still at 200%. At this point we restarted the mongos, and CPU went back to 0. After that we could restart serving traffic normally from the mongos, with only shard A. This procedure was repeated twice, with the same results. The second time it was performed on an upgraded cluster where machines had 4 CPUS. I'm attaching some logs taken on a mongos when CPU was at 400% and no traffic was incoming. I took a sample of current ops on the machine, a mongostat report, and a few lines of mongos.log.
|
| Comments |
| Comment by Edwin Zhou [ 27/Jan/22 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
After some consideration, I believe we don't have enough information to help conclude that there is an improvement to be made to getShardIdsForRange. As we're no longer in the position to collect this data, I'm going to close this ticket as Incomplete. I highly appreciate your cooperation in this investigation and we're always interested in collecting stack traces during these regressions. Best, | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniele Tessaro [ 24/Nov/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi Edwin. You also have to consider that in our scenario the cluster was receiving an heavy (in the hundreds per second I think) load of bulk_write batches (each with possibly a thousand update_one) like the one described, and at the same time an even heavier load of find queries. All of them scanning through 600k chunks (since we basically only had 1 chunk in zone b). I don't know the internals of a range query vs an equality query, but it is possible that there is some inherent overhead for the equality query that scales better than the cost of getShardIdsForRange(), when many operations are performed at the same time. Also, I would argue that since https://jira.mongodb.org/browse/SERVER-46487 was considered a performance issue to fix, then the problem I reported should be considered in the same way, because, in my scenario the number of chunks to scan is only 1 less than it is in the scenario reported in that Jira. And the way we were trying to use MongoDB sharding is nothing too fancy, but basically the one described here: https://docs.mongodb.com/manual/tutorial/sharding-tiered-hardware-for-varying-slas/
Regards Daniele | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Edwin Zhou [ 24/Nov/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi daniele@spiketrap.io,
In my repro, I construct the following zones:
and I query using:
This query intends to target zone 2, and presuming your code analysis, I would expect getShardIdsForRange() to scan 89k chunks in zone 2 before returning zone 2 for 90010 updates. This would suggest that each update in the range query would have some overhead that shouldn't appear in the equality query. However, the data appears to contradict this since the equality query workload takes longer than the range query. It's possible that the repro doesn't have enough chunks that reveals this regression, so I'll continue investigating on this front.
I would consider that ballpark to be large and can be a contributing factor to the regression we saw. Best, | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniele Tessaro [ 24/Nov/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi Edwin, first of all thanks for looking more into it. Unfortunately I won't be able to give more logs/support around the issue, because we moved away from the sharding solution, and I'm not able to replicate the problem. We could workaround this particular issue the way I described (and it worked pretty well for our case, although not perfect), but we had other problems with the extreme slowness of moving chunks, so we went back to a monolith approach for now. That being said, I still think the performance issue is there and the analysis of the code gives a pretty clear explanation of why it is happening. One reason why you might not be able to reproduce it is that your chunks distribution might not be the one that produces the performance issue. The issue happens when the getShardIdsForRange has to scan through a long list of chunks. If while scanning chunks all shards are immediately found, the algorithm exits the loop right away. This means that, even if you have 1M chunks to scan, if you have chunks from shard B and chunks from shard A interleaved, the algorithm is very likely to only have to see a small portion of them. So, basically, if you want to reproduce the issue, this is what you should do:
Now the getShardIdsForRange() function will have to scan 999.999 chunks before returning "ShardA" as the single shard where to look. Now if a single bulk_update is performing say 1000 of these updates, we're talking about a 1000 * 999999 chunks to scan for a single bulk batch.
To answer your question about the size of the `$in` query, I don't have an exact answer, but that can easily be in the 1000 ballpark.
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Edwin Zhou [ 23/Nov/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thank you for the correction. I adjusted the workload and changed the chunk size from 64MB to 1MB to simulate approx. 1 million chunks.
Each document is 533KB and is the following structure:
I perform two update jobs:
The range match update job took 17 minutes and 12 seconds, and the equality match job took 33 minutes and 44 seconds and is about twice as slow. The range query looked at both shards and the equality query looked at one. At (A), the range match update begins; at (B), the range match update ends and the equality match update begins; at (C), the equality match update ends. One thing about your previous query that stands out is the use of a possibly large $in and may be affected by SERVER-1007
Can you confirm the number of terms that was used in the $in? Best, | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniele Tessaro [ 19/Nov/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi Edwin. You mentioned you inserted 500k documents in the collection but how many chunks were you generating? Because the complexity of the algorithm is linear in the number of chunks. The number of documents should be irrelevant. In my case we had 600k chunks and >9 billion documents. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Edwin Zhou [ 19/Nov/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
After some testing, I haven't been able to reproduce a similar regression when querying and updating on a range of documents a zoned sharded collection with 500k documents, instead it performed the contrary. The details of this reproduction are as follows:
We'd need more information to further diagnose this issue. You mentioned that there were periods of time when you paused traffic and cpu usage was still high. If you're able to reproduce this on a separate cluster, can you provide timestamps when traffic was paused? If you're able to attempt a repeat scenario, can you also provide diagnostic data on all nodes as well as mongod and mongos logs for all nodes? One thing that sticks out in reviewing your diagnostic data is that during the time of CPU usage starting at (B), we are performing both queries and updates. When reads are high, writes are low from (B) to (C) and vice versa from (C) to (D). This activity is consistent with CPU usage. Best, | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniele Tessaro [ 02/Nov/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi Edwin. The queries that were breaking the mongos where bulk update operations. I don't have samples of those queries but I can tell you that each batch contained many, potentially thousands, operations of this form (sorry for the syntax that could be broken):
Some more info:
The way we fixed the operation was to change the bulk update to something like:
By looking at the ChunkManager code, It seems pretty straightforward that, in our case, it has to do a linear scan through the list of 600k chunks, for each single bulk update operation. And that definitely can kill the process. Note:
But now we're doing
The first time we had the issue both writes and reads were in that problematic form, and we fixed them both. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Edwin Zhou [ 01/Nov/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
After looking through the diagnostic data files, there isn't anything glaring that would help suggest your hypothesis that the range queries are the culprit for the mongos to be at 100% CPU utilization. However, I acknowledge that you were able to remediate this issue by optimizing your queries and I'm interested in understanding those changes and would like to attempt to reproduce this issue.
Best, | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniele Tessaro [ 01/Nov/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Uploaded, thanks! | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Edwin Zhou [ 01/Nov/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thank you for uploading the additional data. I apologize for the back and forth. Can you also upload the diagnostic.data directory for your CSRS as well? Best, | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniele Tessaro [ 29/Oct/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I attached the diagnostic.data from the other 2 mongos and the 6 mongod instances (3 replicas x 2 shards). | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Edwin Zhou [ 29/Oct/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thank you for uploading the diagnostic.data directory for one mongos. I also neglected to ask for the mongos.log files covering the October 28 incident. Can you attach those as well? Can you also attach the diagnostic.data directory and logs for each sharded replica set and the remaining 2 mongos? Best, | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniele Tessaro [ 29/Oct/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I have uploaded the diagnostic.data file. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniele Tessaro [ 29/Oct/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for your reply. A few more datapoints: 2. I believe the fact that it would still have 100% CPU even with no traffic was possibly due to old operations being enqueued in the mongos, and stuck on that sharding resolution algorithm, if it makes sense. Basically we've noticed that the CPU would go down by either restarting the mongos or by waiting an adequate amount of time (even 1 hr). 3. Coincidentally a release with bad code yesterday caused the problem again. The code introduced a new surge of bulk updates similar to those described in point 1. CPU immediately spiked to 100%. We rolled back within 15 minutes. 1 mongos was restarted and was operative right away. For 2 other mongos we just waited, and it took around 15 minutes for things to get back to normal. I think the longer you're in the state where you're flooding mongos with these bad range queries the longer it is gonna take for mongos to go back to a normal state without restart. 4. I can upload what I currently have in the diagnostic.data, but I'm not sure if it's gonna contain info of the event we had yesterday (not sure how long it retains data). We cannot reproduce the issue because that would cause a disruption of our systems. Honestly, the way I see this problem, and by looking at the code for the mongos, it should not be too complicated to just introduce an in memory cache for {chunk => shard}mapping, that can be invalidated every time a chunk is moved to a different shard. Otherwise, as I said before, you should change the data structure that keeps the mapping so that contiguous chunks are merged into 1. That would reduce the number of chunks enormously in many cases like ours. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Edwin Zhou [ 29/Oct/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I appreciate your patience while I investigate this issue. Are you able to reliably reproduce this problem? From what I understand, you're seeing CPU usage at 100% on all mongos even as balancing is disabled and traffic is no longer accepted. As you stated in your last comment, the possible culprit would be due to a range query which should target only one shard, i.e., a query with a range of (Cutoff, MaxKey], ends up scanning through multiple shards. We solve a similar scenario in To help us understand the reason for the mongos' CPU usage and investigate your hypothesis, we will need to take a look at the $dbpath/diagnostic.data directory (the contents are described here). Would you please archive (tar or zip) the $dbpath/diagnostic.data directory and upload it to this upload portal? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Best, | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniele Tessaro [ 14/Oct/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Ok, I believe this Jira here might be the culprit https://jira.mongodb.org/browse/SERVER-46487 We're doing a lot of range queries with very large ranges. That is probably causing the mongos to scan through the list of chunks to collect all shards, for each query we perform. The weird thing is that, the way our shard is configured, we have all chunks before a given cutoff in shardB and all chunks after in shardA. We use a range query of the form `where shard_key >= CUTOFF` to basically force the query to hit a given shard (to avoid broadcasting). But, since CUTOFF is a very low value, this forces all queries to scan through a ton of chunks. A few ideas:
Pretty sure all these ideas come with a lot of complications, but I think the current implementation suffers from a bottleneck that makes it unusable for a pretty simple use case. As a separate issue, It is worth noticing that, once the mongos get to that 100% cpu state due to the problem described above, even stopping any source of traffic to the shard does not bring the cpu down. Only solution seemed to be either restarting the mongos or waiting more than 1 hour. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniele Tessaro [ 13/Oct/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
IT's also worth noticing that the process of moving the chunks back from ShardB to ShardA, while CPUs are at 100%, is extremely slow. We did another test where we set up the zones so that shardB only contained 1 chunk. While moving that chunk from ShardA to ShardB took a few seconds, moving 1 chunk from ShardB to Shard A took 10 minutes. This is the sharding status while the 1 chunk is being moved from shard B to shard A.
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniele Tessaro [ 13/Oct/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
This is the sharding status right before we enable the balancer, after having setup the zones:
|