[SERVER-60680] mongos CPU 100% even with no traffic Created: 13/Oct/21  Updated: 22/Jun/22  Resolved: 27/Jan/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Daniele Tessaro Assignee: Edwin Zhou
Resolution: Incomplete Votes: 0
Labels: mongos
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2021-11-01 at 6.12.32 PM.png     PNG File Screen Shot 2021-11-19 at 3.04.08 PM.png     PNG File Screen Shot 2021-11-23 at 6.36.19 PM.png     HTML File currentOp     Text File mongo.log     HTML File mongostat     File repro.js    
Operating System: ALL
Participants:

 Description   

We have a big cluster (>7TB data), hosting a sharded collection. The collection was created as sharded, but the cluster initially only had 1 shard (call it shard A), and 3 mongos.

Today we added shard B. We disabled the balancer, added the shard in the cluster, defined 2 zones that fully covered the key space, then enabled the balancer.

The zones were defined in a way such that only few chunks should have been migrated to shard B (as an initial test).

From the moment the balancer was enabled, CPU on the 3 mongos spiked to 200% (100% on both CPUs), causing major disruption to the incoming traffic. The balancer slowly moved chunks from shard A to shard B, at speed of a few seconds per chunk.

We thought the balancing operation was being the issue. So we disabled the balancer, but nothing changed.
Then we also completely stopped the workers that were performing queries to the cluster but that had no effect: mongos were still at 200% CPU.

We redefined the zones so that all chunks should have been migrated back to shard A and re-enabled the balancer, because we thought the presence of data on multiple shards was the culprit. This made no difference. Even when all chunks were migrated back to shard A, CPU on mongos was still at 200%.

At this point we restarted the mongos, and CPU went back to 0. After that we could restart serving traffic normally from the mongos, with only shard A.

This procedure was repeated twice, with the same results. The second time it was performed on an upgraded cluster where machines had 4 CPUS.

I'm attaching some logs taken on a mongos when CPU was at 400% and no traffic was incoming. I took a sample of current ops on the machine, a mongostat report, and a few lines of mongos.log.

 

 



 Comments   
Comment by Edwin Zhou [ 27/Jan/22 ]

Hi daniele@spiketrap.io,

After some consideration, I believe we don't have enough information to help conclude that there is an improvement to be made to getShardIdsForRange. SERVER-46487 was created after observing a performance impact, and we were able to collect perf during the incident. This gave us insight into where exactly cpu time was being allocated.

As we're no longer in the position to collect this data, I'm going to close this ticket as Incomplete. I highly appreciate your cooperation in this investigation and we're always interested in collecting stack traces during these regressions.

Best,
Edwin

Comment by Daniele Tessaro [ 24/Nov/21 ]

Hi Edwin. 

You also have to consider that in our scenario the cluster was receiving an heavy (in the hundreds per second I think) load of bulk_write batches (each with possibly a thousand update_one) like the one described, and at the same time an even heavier load of find queries. All of them scanning through 600k chunks (since we basically only had 1 chunk in zone b).

I don't know the internals of a range query vs an equality query, but it is possible that there is some inherent overhead for the equality query that scales better than the cost of getShardIdsForRange(), when many operations are performed at the same time.

Also, I would argue that since https://jira.mongodb.org/browse/SERVER-46487 was considered a performance issue to fix, then the problem I reported should be considered in the same way, because, in my scenario the number of chunks to scan is only 1 less than it is in the scenario reported in that Jira. And the way we were trying to use MongoDB sharding is nothing too fancy, but basically the one described here: https://docs.mongodb.com/manual/tutorial/sharding-tiered-hardware-for-varying-slas/

 

Regards

Daniele

Comment by Edwin Zhou [ 24/Nov/21 ]

Hi daniele@spiketrap.io,
Thank you for your input and cooperation so far.

One reason why you might not be able to reproduce it is that your chunks distribution might not be the one that produces the performance issue. The issue happens when the getShardIdsForRange has to scan through a long list of chunks. If while scanning chunks all shards are immediately found, the algorithm exits the loop right away. This means that, even if you have 1M chunks to scan, if you have chunks from shard B and chunks from shard A interleaved, the algorithm is very likely to only have to see a small portion of them.

In my repro, I construct the following zones:

tag: zone1  { "a" : 0 } -->> { "a" : 900000 }
tag: zone2  { "a" : 900000 } -->> { "a" : 1000000 }

and I query using:

{ b: i, a: {gte: 900000}}

This query intends to target zone 2, and presuming your code analysis, I would expect getShardIdsForRange() to scan 89k chunks in zone 2 before returning zone 2 for 90010 updates. This would suggest that each update in the range query would have some overhead that shouldn't appear in the equality query. However, the data appears to contradict this since the equality query workload takes longer than the range query.

It's possible that the repro doesn't have enough chunks that reveals this regression, so I'll continue investigating on this front.

To answer your question about the size of the `$in` query, I don't have an exact answer, but that can easily be in the 1000 ballpark.

I would consider that ballpark to be large and can be a contributing factor to the regression we saw.

Best,
Edwin

Comment by Daniele Tessaro [ 24/Nov/21 ]

Hi Edwin,

first of all thanks for looking more into it. Unfortunately I won't be able to give more logs/support around the issue, because we moved away from the sharding solution, and I'm not able to replicate the problem. We could workaround this particular issue the way I described (and it worked pretty well for our case, although not perfect), but we had other problems with the extreme slowness of moving chunks, so we went back to a monolith approach for now. 

That being said, I still think the performance issue is there and the analysis of the code gives a pretty clear explanation of why it is happening.

One reason why you might not be able to reproduce it is that your chunks distribution might not be the one that produces the performance issue. The issue happens when the getShardIdsForRange has to scan through a long list of chunks. If while scanning chunks all shards are immediately found, the algorithm exits the loop right away. This means that, even if you have 1M chunks to scan, if you have chunks from shard B  and chunks from shard A interleaved, the algorithm is very likely to only have to see a small portion of them.

So, basically, if you want to reproduce the issue, this is what you should do:

  • Let's say the shard key is an integer value.
  • shard B should have 1 chunk, with shard key: [minKey, 0)
  • shard A should have 999.999 chunks. All with shard key >=0. Like [0, n0), [n0, n1)... (You can setup a similar scenario easily with zones)
  • Now you perform an update query using `shard_key >=0` as the filter.

Now the getShardIdsForRange() function will have to scan 999.999 chunks before returning "ShardA" as the single shard where to look. Now if a single bulk_update is performing say 1000 of these updates, we're talking about a 1000 * 999999 chunks to scan for a single bulk batch.

 

 To answer your question about the size of the `$in` query, I don't have an exact answer, but that can easily be in the 1000 ballpark.

 

 

 

Comment by Edwin Zhou [ 23/Nov/21 ]

Hi daniele@spiketrap.io,

Thank you for the correction. I adjusted the workload and changed the chunk size from 64MB to 1MB to simulate approx. 1 million chunks.

db.getSiblingDB("config").getCollection("settings").insertOne({_id: "chunksize", value:1})
 
// sh.status() chunks
shard01	900697
shard02	89140

Each document is 533KB and is the following structure:

{
  a: number
  b: number
  x: string
  field_0: string
  .
  .
  field_511: string
}

I perform two update jobs:

// range match
coll.updateOne({ b: i, a: {gte: 900000}}, [{$set: {field_0: string}}, ..., {$set: {field_1: string}}]);
// equality match
coll.updateOne({ b: i, a: i}, [{$set: {field_0: string}}, ..., {$set: {field_1: string}}]);

The range match update job took 17 minutes and 12 seconds, and the equality match job took 33 minutes and 44 seconds and is about twice as slow. The range query looked at both shards and the equality query looked at one.

At (A), the range match update begins; at (B), the range match update ends and the equality match update begins; at (C), the equality match update ends.

Based on this test, there doesn't appear to be any evidence that there is a performance degradation caused by getShardIdsForRange(). However, we will not have conclusive evidence without stack traces, mongodb logs, and timestamps for the actions that were taken.

One thing about your previous query that stands out is the use of a possibly large $in and may be affected by SERVER-1007

my_collection.find( {'$and': [{time: {'$gte': '2016-01-01'}, { _id: {$in: [document_id1, document_id2, ...] }}]})

Can you confirm the number of terms that was used in the $in?

Best,
Edwin

Comment by Daniele Tessaro [ 19/Nov/21 ]

Hi Edwin.

You mentioned you inserted 500k documents in the collection but how many chunks were you generating? 

Because the complexity of the algorithm is linear in the number of chunks. The number of documents should be irrelevant. 

In my case we had 600k chunks and >9 billion documents.

Comment by Edwin Zhou [ 19/Nov/21 ]

Hi daniele@spiketrap.io,

After some testing, I haven't been able to reproduce a similar regression when querying and updating on a range of documents a zoned sharded collection with 500k documents, instead it performed the contrary. The details of this reproduction are as follows:

  1. Create a zoned sharded collection
  2. Insert 500k documents
  3. Perform updates querying on a range of documents, this takes about 2 minutes
  4. Perform updates querying on exact matches, this takes about 3 minutes.
    repro.js
    I've also modified the zone and attempted the same update job. Both update jobs took 2 minutes in this case.

We'd need more information to further diagnose this issue. You mentioned that there were periods of time when you paused traffic and cpu usage was still high. If you're able to reproduce this on a separate cluster, can you provide timestamps when traffic was paused? If you're able to attempt a repeat scenario, can you also provide diagnostic data on all nodes as well as mongod and mongos logs for all nodes?

One thing that sticks out in reviewing your diagnostic data is that during the time of CPU usage starting at (B), we are performing both queries and updates. When reads are high, writes are low from (B) to (C) and vice versa from (C) to (D). This activity is consistent with CPU usage.

Best,
Edwin

Comment by Daniele Tessaro [ 02/Nov/21 ]

Hi Edwin.

The queries that were breaking the mongos where bulk update operations. I don't have samples of those queries but I can tell you that each batch contained many, potentially thousands, operations of this form (sorry for the syntax that could be broken):

{   update_one: {        filter: { _id: document_id, time: {'$gte': Date('2016-01-01')} },        update: { field1: {'$set': value1}, ... fieldn: {'$set': valuen}}   }} 

Some more info:

  • None of those field updates changed the time field
  • As I mentioned, there were 2 shards in the collection. Almost all the data (around 7 billion docs) was in ShardA, since we just started balancing with zones, where ShardA is assigned a zone >= '2016-01-01' and ShardB is assigned < '2016-01-01'.
  • ShardA has around 600k chunks

The way we fixed the operation was to change the bulk update to something like:

{   update_one: {       filter: { _id: document_id, time: document_time},       update: { field1: {'$set': value1}, ... fieldn: {'$set': valuen}}    }} 

By looking at the ChunkManager code, It seems pretty straightforward that, in our case, it has to do a linear scan through the list of 600k chunks, for each single bulk update operation. And that definitely can kill the process.

Note:
We didn't only change the way we perform updates. Even reads were changed. Initially, to get documents by id, we were doing something like this(syntax might be broken)

my_collection.find( {'$and': [{time: {'$gte': '2016-01-01'}, { _id: {$in: [document_id1, document_id2, ...] }}]})  

But now we're doing

my_collection.find( {'$or': [ {_id: document_id1, time: document_time1}, {_id: document_id2, time: document_time2}]})

The first time we had the issue both writes and reads were in that problematic form, and we fixed them both.
But then a bad code change caused writes to go back to the problematic form. And writes alone were able to bring the mongos down. So I think you can focus on writes.

Comment by Edwin Zhou [ 01/Nov/21 ]

Hi daniele@spiketrap.io,

After looking through the diagnostic data files, there isn't anything glaring that would help suggest your hypothesis that the range queries are the culprit for the mongos to be at 100% CPU utilization. However, I acknowledge that you were able to remediate this issue by optimizing your queries and I'm interested in understanding those changes and would like to attempt to reproduce this issue.

  • Can you provide an example of the query that caused this issue, and a query that remediated this issue?
  • If this behavior occurs again, can you collect stack traces with gdb on the stalled mongos?

    # collect stack trace samples with gdb
    gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' >gdb.txt
    

Best,
Edwin

Comment by Daniele Tessaro [ 01/Nov/21 ]

Uploaded, thanks!

Comment by Edwin Zhou [ 01/Nov/21 ]

Hi daniele@spiketrap.io,

Thank you for uploading the additional data. I apologize for the back and forth. Can you also upload the diagnostic.data directory for your CSRS as well?

Best,
Edwin

Comment by Daniele Tessaro [ 29/Oct/21 ]

I attached the diagnostic.data from the other 2 mongos and the 6 mongod instances (3 replicas x 2 shards).
Unfortunately we don't have enough retention for the mongos.log and mongod.log files to cover the october 28 incident.

Comment by Edwin Zhou [ 29/Oct/21 ]

Hi daniele@spiketrap.io,

Thank you for uploading the diagnostic.data directory for one mongos. I also neglected to ask for the mongos.log files covering the October 28 incident. Can you attach those as well?

Can you also attach the diagnostic.data directory and logs for each sharded replica set and the remaining 2 mongos?

Best,
Edwin

Comment by Daniele Tessaro [ 29/Oct/21 ]

I have uploaded the diagnostic.data file.

Comment by Daniele Tessaro [ 29/Oct/21 ]

Thanks for your reply. A few more datapoints:
1. Since then we have fixed the issue by replacing range queries on the shard key with exact queries, where it was possible. This completely fixed the problem (although it's not great from the application side). Note that we were performing a ton of bulk_updates with very big batches with filters looking like: {_id: ..., shard_key > CUTOFF }.

2. I believe the fact that it would still have 100% CPU even with no traffic was possibly due to old operations being enqueued in the mongos, and stuck on that sharding resolution algorithm, if it makes sense. Basically we've noticed that the CPU would go down by either restarting the mongos or by waiting an adequate amount of time (even 1 hr).

3. Coincidentally a release with bad code yesterday caused the problem again. The code introduced a new surge of bulk updates similar to those described in point 1. CPU immediately spiked to 100%. We rolled back within 15 minutes. 1 mongos was restarted and was operative right away. For 2 other mongos we just waited, and it took around 15 minutes for things to get back to normal. I think the longer you're in the state where you're flooding mongos with these bad range queries the longer it is gonna take for mongos to go back to a normal state without restart.

4. I can upload what I currently have in the diagnostic.data, but I'm not sure if it's gonna contain info of the event we had yesterday (not sure how long it retains data). We cannot reproduce the issue because that would cause a disruption of our systems.

Honestly, the way I see this problem, and by looking at the code for the mongos, it should not be too complicated to just introduce an in memory cache for

{chunk => shard}

mapping, that can be invalidated every time a chunk is moved to a different shard.

Otherwise, as I said before, you should change the data structure that keeps the mapping so that contiguous chunks are merged into 1. That would reduce the number of chunks enormously in many cases like ours.

Comment by Edwin Zhou [ 29/Oct/21 ]

Hi daniele@spiketrap.io,

I appreciate your patience while I investigate this issue. Are you able to reliably reproduce this problem? From what I understand, you're seeing CPU usage at 100% on all mongos even as balancing is disabled and traffic is no longer accepted.

As you stated in your last comment, the possible culprit would be due to a range query which should target only one shard, i.e., a query with a range of (Cutoff, MaxKey], ends up scanning through multiple shards. We solve a similar scenario in SERVER-46487 when we know a shard contains [MinKey,MaxKey] but it would make sense to apply a similar behavior when we know a query should target a specific shard zone.

To help us understand the reason for the mongos' CPU usage and investigate your hypothesis, we will need to take a look at the $dbpath/diagnostic.data directory (the contents are described here). Would you please archive (tar or zip) the $dbpath/diagnostic.data directory and upload it to this upload portal? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Best,
Edwin

Comment by Daniele Tessaro [ 14/Oct/21 ]

Ok, I believe this Jira here might be the culprit

https://jira.mongodb.org/browse/SERVER-46487

We're doing a lot of range queries with very large ranges. That is probably causing the mongos to scan through the list of chunks to collect all shards, for each query we perform.
The Jira is reported as closed, but I think the resolution proposed there is only to return all shards when the range is MinKey => MaxKey, which is a trivial case.

The weird thing is that, the way our shard is configured, we have all chunks before a given cutoff in shardB and all chunks after in shardA. We use a range query of the form `where shard_key >= CUTOFF` to basically force the query to hit a given shard (to avoid broadcasting). But, since CUTOFF is a very low value, this forces all queries to scan through a ton of chunks.
Isn't there a way we can work this around?

A few ideas:

  1. Explicitly force mongos to use a shard through some directive, rather than through the range query (although this could create inconsistencies for chunks that are being migrated)
  2. Improve getShardIdsForRange() to use a more compact representation of the chunks. In theory, in our specific case, I would expect the chunks => shard map to have 2 entries: { [MinKey, Cutoff) => ShardB, [Cutoff, MaxKey) => ShardA }

    , that is to say: contiguous chunks could be collapsed to make the lookup much faster.

  3. Improve getShardIdsForRange() with information coming from zones definition. If a range is completely included in a zone's range, then it should resolve only to the shards attached to that zone (modulo the presence of chunks that are not compliant to the zone, or are in flight, which should be handled separately).

Pretty sure all these ideas come with a lot of complications, but I think the current implementation suffers from a bottleneck that makes it unusable for a pretty simple use case.

As a separate issue, It is worth noticing that, once the mongos get to that 100% cpu state due to the problem described above, even stopping any source of traffic to the shard does not bring the cpu down. Only solution seemed to be either restarting the mongos or waiting more than 1 hour.

Comment by Daniele Tessaro [ 13/Oct/21 ]

IT's also worth noticing that the process of moving the chunks back from ShardB to ShardA, while CPUs are at 100%, is extremely slow. We did another test where we set up the zones so that shardB only contained 1 chunk. While moving that chunk from ShardA to ShardB took a few seconds, moving 1 chunk from ShardB to Shard A took 10 minutes.

 This is the sharding status while the 1 chunk is being moved from shard B to shard A.

mongos> sh.status()
--- Sharding Status ---
  sharding version: {
  	"_id" : 1,
  	"minCompatibleVersion" : 5,
  	"currentVersion" : 6,
  	"clusterId" : ObjectId("60a06705eeeffab125d9e6df")
  }
  shards:
        {  "_id" : "raw_shard_arsn0",  "host" : "raw_shard_arsn0/<hostnames>",  "state" : 1,  "tags" : [ "recent" ] }
        {  "_id" : "raw_shard_brsn0",  "host" : "raw_shard_brsn0/<hostnames>",  "state" : 1,  "tags" : [ "historical" ] }
  active mongoses:
        "4.4.9" : 3
  autosplit:
        Currently enabled: yes
  balancer:
        Currently enabled:  yes
        Currently running:  yes
        Collections with active migrations:
                <database_name>.<collection_name> started at Wed Oct 13 2021 15:46:37 GMT-0700 (PDT)
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours:
                627 : Success
  databases:
        {  "_id" : "<database>",  "primary" : "raw_shard_arsn0",  "partitioned" : true,  "version" : {  "uuid" : UUID("a7bc344d-fe5c-4926-a13c-df12b83ac439"),  "lastMod" : 1 } }
        {  "_id" : "config",  "primary" : "config",  "partitioned" : true }
                config.system.sessions
                        shard key: { "_id" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                raw_shard_arsn0	512
                                raw_shard_brsn0	512
                        too many chunks to print, use verbose if you want to force print
        {  "_id" : "<database_name>",  "primary" : "raw_shard_arsn0",  "partitioned" : true,  "version" : {  "uuid" : UUID("d04e1dc0-4c86-4909-8c0a-c9a97ed0b7a3"),  "lastMod" : 1 } }
                <database_name>.<collection_name>
                        shard key: { "time" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                raw_shard_arsn0	593007
                                raw_shard_brsn0	1
                        too many chunks to print, use verbose if you want to force print
                         tag: recent  { "time" : { "$minKey" : 1 } } -->> { "time" : { "$maxKey" : 1 } }

 

Comment by Daniele Tessaro [ 13/Oct/21 ]

This is the sharding status right before we enable the balancer, after having setup the zones:

 

mongos> sh.status()
--- Sharding Status ---
  sharding version: {
  	"_id" : 1,
  	"minCompatibleVersion" : 5,
  	"currentVersion" : 6,
  	"clusterId" : ObjectId("60a06705eeeffab125d9e6df")
  }
  shards:
        {  "_id" : "raw_shard_arsn0",  "host" : "raw_shard_arsn0/<hostnames>",  "state" : 1,  "tags" : [ "recent" ] }
        {  "_id" : "raw_shard_brsn0",  "host" : "raw_shard_brsn0/<hostnames>",  "state" : 1,  "tags" : [ "historical" ] }
  active mongoses:
        "4.4.9" : 3
  autosplit:
        Currently enabled: yes
  balancer:
        Currently enabled:  yes
        Currently running:  yes
        Collections with active migrations:
                <database_name>.<collection_name> started at Wed Oct 13 2021 15:39:53 GMT-0700 (PDT)
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours:
                626 : Success
  databases:
        {  "_id" : "<database>",  "primary" : "raw_shard_arsn0",  "partitioned" : true,  "version" : {  "uuid" : UUID("a7bc344d-fe5c-4926-a13c-df12b83ac439"),  "lastMod" : 1 } }
        {  "_id" : "config",  "primary" : "config",  "partitioned" : true }
                config.system.sessions
                        shard key: { "_id" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                raw_shard_arsn0	512
                                raw_shard_brsn0	512
                        too many chunks to print, use verbose if you want to force print
        {  "_id" : "<database_name>",  "primary" : "raw_shard_arsn0",  "partitioned" : true,  "version" : {  "uuid" : UUID("d04e1dc0-4c86-4909-8c0a-c9a97ed0b7a3"),  "lastMod" : 1 } }
                 <database_name>.<collection_name>
                        shard key: { "time" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                raw_shard_arsn0	593002
                        too many chunks to print, use verbose if you want to force print
                         tag: historical  { "time" : { "$minKey" : 1 } } -->> { "time" : ISODate("2003-01-01T00:00:00Z") }
                         tag: recent  { "time" : ISODate("2003-01-01T00:00:00Z") } -->> { "time" : { "$maxKey" : 1 } }

Generated at Thu Feb 08 05:50:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.