[SERVER-28232] Performance degradation after upgrade from 3.0.11 to 3.2.12 Created: 07/Mar/17  Updated: 08/Jan/24  Resolved: 27/Nov/17

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: 3.2.12
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jason Terpko Assignee: Kelsey Schubert
Resolution: Done Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Related
Operating System: ALL
Participants:

 Description   

Environment: CentoOS 6 (kernel 2.6.32), MongoDB community edition 3.0.11 and 3.2.12 , storage engine MMAP, config servers on SCCC

We are experiencing performance degradation when moving from 3.0.11 to 3.2.12. Application throughput is getting reduced by 5-10 times in 3.2.12 compared to 3.0.11. In the past, we had attempted to upgrade from 3.0.11 to 3.2.8 but due to https://jira.mongodb.org/browse/SERVER-26159 bug, we rollback to 3.0.11. In 3.2.8 application throughput was fine but since the mongos were randomly crashing due to SERVER-26159 we rollback to 3.0.11. Bug SERVER-26159 fixed in 3.2.10 so we attempt to upgrade but we got our performance reduced so we rollback to 3.0.11 again. We opened a JIRA SERVER-26654 about this issue (and several other people report almost the same issues) and according to Jira the issue was solved in 3.2.12. We attempt to upgrade to 3.2.12 but we got the same performance degradation as the 3.2.10 upgrade.

The issue we are seeing in the logs after increasing the verbosity from 1 to 2 is the following:

I ASIO    [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to connect to (node) - ExceededTimeLimit: Operation timed out
D ASIO  [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to execute command: RemoteCommand 23628777 -- target:(node) db:admin cmd:{ 
isMaster: 1 } reason: ExceededTimeLimit: Operation timed out
isMaster command is timeout for different "TaskExecutorPool" all the time.

Note: I am not changing the "protocolVersion" to 1 after the 3.0.11 to 3.2.12 upgrade as makes the rollback harder.

We managed to reproduce the issue with sysbench-mongodb using 3.2.12 on a 10 nodes sharded cluster, not in the scale we getting it on our production system.

To remedy the issue in testing we changed taskExecutorPoolSize value:

Our mongos has 6 CPUs so I assume it creates 6 connection pools with defaults. Using a smaller value like "taskExecutorPoolSize"=2 reduces the timeouts so it seems the more connection pools I use the more timeouts I get during the benchmark. When I set "taskExecutorPoolSize"=1, which I believe set a single connection pool, I am not getting the above timeouts.

We also modified the ShardingTaskExecutorPoolRefreshTimeoutMS from the default 20 seconds to 60 seconds which also eliminated the timeouts.

We combined both on production but unfortunately, the timeouts didn't go away and we still noticed the same performance degradation.

setParameter:
 ShardingTaskExecutorPoolRefreshTimeoutMS: 60000
 taskExecutorPoolSize: 1

I want to believe that is not our workload that triggering the performance degradation as it operates fine on 3.0.11

The purpose of the thread is to understand what has changed between 3.2.8 and 3.2.12 that might trigger the isMaster request to fail between mongod and mongos.

Much appreciated If anyone has internals on the change or is facing the same problem and found a workaround.

Thanks in advance,

Jason



 Comments   
Comment by Kelsey Schubert [ 27/Nov/17 ]

Since 3.2.15 has been out for some time, and there hasn't been any additional activity on this ticket, I'm going to resolve it. If this is still an issue after upgrading, please let us know and we'll continue to investigate.

Kind regards,
Kelsey

Comment by Kelsey Schubert [ 27/Nov/17 ]

Hi dharshanr@scalegrid.net,

Please feel free to review our documentation on these parameters. If you have suggestions to improve our documentation, would you please open a DOCS ticket?

Thank you,
Kelsey

Comment by Dharshan Rangegowda [ 13/Nov/17 ]

Hi folks,

Is it possible to get the details of the connPoolMaxShardedConnsPerHost documented? A few more questions on Sam's comments above
1)Is this limit per target host (connected from mongos) or is it for the mongos server in total?
2) Is this connPoolMaxConnsPerHost parameter required for connPoolMaxShardedConnsPerHost? If this parameter is only for mongos can we skip connPoolMaxConnsPerHost right?
3) Is there only one pool per server or does it depend on the number of cores

Comment by Mira Carey [ 23/Jun/17 ]

antogiann,

That sounds good, I'm happy to leave this open until 3.2.15 comes out and revisit then.

Comment by Antonis Giannopoulos [ 22/Jun/17 ]

A brief update:

We are currently deploying 3.2.14 with the following settings to production clusters

 ShardingTaskExecutorPoolRefreshTimeoutMS: 600000
 ShardingTaskExecutorPoolHostTimeoutMS: 300000000
 ShardingTaskExecutorPoolMinSize: 100

We have started deployment using a busy ascending order and so far the workaround is solid.

We would like to keep the ticket open until 3.2.15 is out. Currently, there is a cost during a mongos restart as it has to initialize the connections in advance (fixed cost: ShardingTaskExecutorPoolMinSize*Shards* connection pools). it's not a huge issue but we would like to investigate the impact of SERVER-29237 to this behavior.

Thanks,
Ant

Comment by Samantha Ritter (Inactive) [ 17/May/17 ]

Hi Ant,

Yes, you are seeing log messages from two different families of pools here. For historical reasons, we use two separate kinds of connection pools on the write path and on the read path.

The read path uses the NetworkInterfaceASIO pools. All of the ShardingTaskExecutorPool* parameters that you've been working with affect NetworkInterfaceASIO pools only.

On the write path, we use connection pools called DBClient pools. These have a default maximum capacity of 200. The fact that you're seeing so many log messages about connections in this pool being dropped means you are churning connections, because you consistently need more than 200 at any given time. So, we open more connections as-needed to meet the demands on the pool, but we only ever keep 200 of them around for re-use. I would suggest that you set this pool size to something higher, and see if that helps with both your performance issues and helps decrease the churn and log messages. You can do this with the following:

--setParameter connPoolMaxConnsPerHost=300

and

--setParameter connPoolMaxShardedConnsPerHost=300

When examining logging, the component part of the message should tell you which part of the system it's coming from. The component is the part of the log line in square brackets. The NetworkInterfaceASIO pools (the read path) log with component NetworkInterfaceASIO, and the DBClient-based (write) pools log with various other components, based on what the pool is being used for.

Let me know if this makes sense, I know there are a lot of moving parts here.

Thanks,
Sam

Comment by Antonis Giannopoulos [ 17/May/17 ]

I was testing with 3.2.13 version and the following settings

 ShardingTaskExecutorPoolRefreshTimeoutMS: 600000
 ShardingTaskExecutorPoolHostTimeoutMS: 300000000
 ShardingTaskExecutorPoolMinSize: 100

and I received tons of Informational messages (27436 in 1 hour) of the type:

Ending idle connection to host <hostname>:<port>(with timeout of 0 seconds) because the pool meets constraints; 246 connections to that host remain open

The number of connections on the message varies but its constantly over 200.

Is there a maxPoolSize=200 limit enforced on each pool? or this is a different pool?

Thanks,
Ant

Comment by Jon Hyman [ 14/May/17 ]

Hi Jason, Wayne has tried setting various configuration values in our case to no avail. The last configuration we tried was this:

setParameter:
ShardingTaskExecutorPoolRefreshTimeoutMS: 60000
ShardingTaskExecutorPoolHostTimeoutMS: 300000000
ShardingTaskExecutorPoolMinSize: 125

We did not set a max pool size, I am fearful of limiting connections. We have thousands of processes which connect to our mongoS, in total we run tens of thousands of application server threads which may talk to the database and I am fearful of limiting connections to something as small as 200*num_mongos*num_cores_per_mongos.

In thinking about this ticket and our problems, are you continuing to do more investigation based on these settings and what works to make changes to how the router works, or is your current approach that we should continue to tweak these settings as needed? The problem as I see with the latter that when the ExceededTimeLimit issue occurs, we have downtime of some kind due to read/write failures to the mongoS. It's a really bad situation when that happens.

The instances that we're running 3.2.12 on right now have lower load by more than an order of magnitude from our busiest mongo instances, so I am extremely concerned that even if we get these settings right on one instance as Antonios has – we now have a 3.2.12 instance that we haven't seen the problem on, but it's about 4-10x smaller/less busy than the one Wayne is reporting problems on, which is at least 10x smaller/less busy than our busiest instances – that we're going to be relying on fragile configuration that may not support more application server growth or throughput.

Comment by Antonis Giannopoulos [ 13/May/17 ]

Hi Jason,

At the moment we run only one 3.2.12 sharded cluster in production with:

ShardingTaskExecutorPoolHostTimeoutMS: 300000000
ShardingTaskExecutorPoolMinSize: 100

We made the change on May 08 and so far we haven't seen any "ExceededTimeLimit: Operation timed out" exception. The change seems to be on the right direction but compared to our other clusters, this one is considered as quiet from traffic perspective, so we can't assume that is an acceptable workaround before we test it against a busier cluster, which is an action item for the near future.

Beyond production deployment, I tested various configuration to our test environment that we are able to reproduce the exception using mongobench.

Before each test I restart the mongos and application servers and the duration of each the test is 1 hour.

On the first round of tests I was increasing the traffic on the cluster gradually by adding more application servers on the test:

Configuration01: We didn't hit the exception

ShardingTaskExecutorPoolHostTimeoutMS: 300000000
ShardingTaskExecutorPoolMinSize: 100

Configuration02: We hit the exception on random times

 ShardingTaskExecutorPoolHostTimeoutMS: 1200000
ShardingTaskExecutorPoolMinSize: 100 

Configuration03: We hit the exception on random times

 ShardingTaskExecutorPoolMinSize: 100 

Configuration04: We hit the exception on random times

 ShardingTaskExecutorPoolHostTimeoutMS: 1200000 

Configuration05: We hit the exception on random times

 ShardingTaskExecutorPoolHostTimeoutMS: 300000000 

Configuration05: We hit the exception on random times

 ShardingTaskExecutorPoolHostTimeoutMS: 1200000
ShardingTaskExecutorPoolMinSize: 100
taskExecutorPoolSize: 1

On the second round of tests I was starting all application servers at once - in order to simulate a spike:

Configuration06: We hit the exception' when the application servers started for around 30 seconds and then didn't appear again.

 ShardingTaskExecutorPoolHostTimeoutMS: 3600000 
ShardingTaskExecutorPoolMinSize: 100 

Configuration07: We hit the exception' when the application servers started for around 1-2 minutes and then didn't appear again.

 ShardingTaskExecutorPoolHostTimeoutMS: 300000000 
ShardingTaskExecutorPoolMinSize: 200 

Configuration08: We hit the exception' when the application servers started for around 10 seconds and then didn't appear again.

 ShardingTaskExecutorPoolHostTimeoutMS: 300000000 
ShardingTaskExecutorPoolMinSize: 100 

Configuration09: We didn't hit the exception

 ShardingTaskExecutorPoolHostTimeoutMS: '300000000' 
ShardingTaskExecutorPoolMinSize: 100
 ShardingTaskExecutorPoolRefreshTimeoutMS: 600000 

For configurations06-08 I assume we got the exception in the beggining as the connection pool wasn't fully initialized yet. For test Configuration07 (which was the ugliest) I also received exceptions of the type "ExceededTimeLimit: Couldn't get a connection within the time limit" which also lead me to the conclusion that mongos needs time to prepare the pools/connections. For Configuration09 I believe the ShardingTaskExecutorPoolRefreshTimeoutMS absorve the errors but (as expected) it didn't help the mongos to prepare the pools faster (the application start reported traffic almost the same time as Configuration08). The only observation I can't explain is why Configuration06 produced more errors than Configuration08. My assumption was that they will produce the same.

The next step was to test configurations06-08 without restarting the mongos (not a cold restart) . During these tests we didn't hit the exception. I assume because of the fact that the pools/connections were already there. Additionally during these tests the application reported traffic faster than the ones I restarted the mongos (the application servers were still restarted).

My next of set (ongoing) is adding various values for "taskExecutorPoolSize" for Configuration09. When I finalize the optimal configuration which I think it will be Configuration09 I am also going to test it against global/partial step-downs.

My conclusion so far is that the connection spikes seems to causing the abnormal behavior, so maintaining open connections is the right path.

I am not sure if in this phase of testing I need to add ShardingTaskExecutorPoolMaxSize on the equation. I might give it a try in later stages. Theoretically the value of 200 will limit the concurrent operations to 200*num_mongos*num_connection_pools, correct?

Thanks,
Ant

Comment by Mira Carey [ 12/May/17 ]

jonhyman and wegerer,

Responding here, to your comments on SERVER-28629. I believe this ticket includes the logs related to your problem, so I'd like to keep the conversation here if possible.

I've made an attempt to correlate some events in the logs you uploaded to our support site and I have a story for how 3.2.8 might have helped you, as well as why it may not be a full solution.

I filtered my view down to one of the uploaded app servers, and then to egress connections to one particular shard. I'm omitting the names here as they aren't otherwise available on this ticket and I'm not sure if you want those names to be public. For what counts as an egress connection, it's anything logged by the task executor with "Connecting to", "Successfully connected to" or "Failed to connect to"

Given that view, I see the queue of "connecting" connections rise from 0 to a peak of 46 from 7:09:06 - 7:09:30. 59 total connections are opened during that time.

Then, from 7:12:41 - 7:14:04, 206 total connections are opened, peaking with 159 in the "connecting" state

Finally, from 7:19:58 - 7:21:12, 160 connections are opened, before the first "ExceededTimeLimit" appears, where previously 261 connections were in the "connecting" state.

That's a behavior pattern that might indicate bursts of traffic, with connection dumping in between (in which case setting min pool size would help by preventing that dumping).

It might also indicate bursts of new traffic, as those log lines would present in the same way if we simply added more load. In that latter case, you might do better to set max pool size, as that would throttle the opening of new connections and prevent the connections spikes that you're clearly experiencing.

If it's a mixture of both, and you don't have too many long running operations, you might be best served by a combination of both approaches:

setParameter=ShardingTaskExecutorPoolHostTimeoutMS=3600000
setParameter=ShardingTaskExecutorPoolMinSize=100
setParameter=ShardingTaskExecutorPoolMaxSize=200

Such a configuration would only drop connections from a pool after an hour without traffic and would steadily maintain in the range of 100-200 connections to any given host. It seems that your shards are capable of absorbing around 100 new connections in a burst, but drown out and enter timeout cycles at around 200.

In the same vein as my comment to antogiann, I'd be interested in seeing what your logs look like with the various mongos options enabled.

Comment by Mira Carey [ 12/May/17 ]

Hi antogiann,

Now that a few days have passed, how has the workaround described in SERVER-28629 worked out for you?

I'd be interested in knowing if things have stayed stable for you, or what sort of behavior you've been seeing if not.

Regards,
Jason

Comment by Antonis Giannopoulos [ 08/May/17 ]

Hi,

We deployed in production (version 3.2.12) one of the workarounds described on SERVER-28629:

ShardingTaskExecutorPoolHostTimeoutMS: 300000000
ShardingTaskExecutorPoolMinSize: 100

The ShardingTaskExecutorPoolMinSize came from the fact that the shards were maintain around 2000 connections prior to the change. We are running with 4 mongos and each mongos has 6 CPUs (so 6 connection pools) so I am expecting to have at least 4*6*100=2400 connections which is close to what the cluster needs.

There is no good reason why we picked ShardingTaskExecutorPoolHostTimeoutMS= 300000000. It was the only value that during the benchmarks in our test envairoment didn't produced the exception but as the test continues we might be able to determine a lower value for it.

So far I haven't noticed any "Failed to connect to - ExceededTimeLimit: Operation timed out" on the logs but its only 4 hours since we made the change. I will have some feedback after 24-48 hours and when the ShardingTaskExecutorPoolHostTimeoutMS gets exhausted.

Ant

Comment by Antonis Giannopoulos [ 02/May/17 ]

Hi, I noticed that 3.2.13 released last week. I am going to try to reproduce the issue against 3.2.13 this week but I was wondering if there are any improvements on logging that might help us reveal the source of the error?

Thanks,
Ant

Comment by Andrew Morrow (Inactive) [ 12/Apr/17 ]

Hi antogiann - If you are running 2.12 then I believe we can rule out the glibc bug as a cause. Thanks for letting us know: it is helpful to us to eliminate that as a potential cause for the performance issue you are experiencing.

Comment by Antonis Giannopoulos [ 11/Apr/17 ]

Hi Thomas

Its 2.12. I am aware of https://bugzilla.kernel.org/show_bug.cgi?id=99671 so I had strace logging 3.2.12 mongos calls during the benchmark but I didn't catch any abnormal behavior.
Do you believe is related to glibc version? (If I recall from previous tickets one of the users reported the issue was using Windows.) Is there any diagnostic that will help us understand if glibc is the issue or not?

Thanks,
Ant

Comment by Kelsey Schubert [ 11/Apr/17 ]

Hi jonhyman,

We're continuing to investigate this issue. To help us rule out some theories, would you please provide the version of glibc that you are using by executing the following command?

ldd --version

Thank you,
Thomas

Comment by Jon Hyman [ 11/Apr/17 ]

Hi @Thomas Schubert, do you have an update on this ticket? I also wanted to point out SERVER-28629 which may be related to this.

Comment by Kelsey Schubert [ 17/Mar/17 ]

Hi jason_or,

We're continuing to examine the data you've provided. The heavy load I identified previously (a spike in commands/s) does not appear to be correlated to timeouts recorded in the logs. We'll update this ticket as our investigation continues.

Thank you for your patience,
Thomas

Comment by Jason Terpko [ 14/Mar/17 ]

Hi Thomas,

Were you able to review the additional information and provided logs?

Thank you,

Jason

Comment by Jason Terpko [ 09/Mar/17 ]

Hi Thomas,

During the upgrade process the application was paused at times allowing for actions like mongos restarts, step downs, or if the application queue backlogged. The application was paused at the start of that window to perform a rolling mongos restart, at this time we increased taskExecutorPoolSize from 1 to 2 as test prior to rolling back.

The application was un-paused at various times during that ~1 hour log sample. Pausing and un-pausing of the application during this process is normal and the clusters handle this just fine pre-3.2.10. A very similar workload (i.e. throughput) ran fine on different cluster running 3.2.8 but we encountered SERVER-26159 and had to rollback. The application was again un-paused after rolling back to 3.0.11 with no issues reported.

Because the diagnostics files are binary we can't read them, what metric in the file showed significant load? This should be due to new connections from the rolling mongos restart and un-pausing of the application. I am collecting the mongos log files now to be uploaded to the portal. Minus authentication and query log lines.

Note, my team member Antonios may also be commenting on this Jira issue based on availability.

Thank you,

Jason

Comment by Kelsey Schubert [ 08/Mar/17 ]

Hi jason_or,

I see that between 2017-03-02T15:44:35.027 UTC and 2017-03-02T15:46:21.476Z UTC, your servers have significantly more load. If not, can you attempt to correlate correlate this time to the time outs in the mongos? Are you able to upload the mongos logs to the secure portal as well?

Thank you,
Thomas

Comment by Jason Terpko [ 07/Mar/17 ]

Thomas,

Thank you for confirming the contents of diagnostic.data, I have uploaded metrics from the primaries while running version 3.2.12. According to the log these were the primaries after the first step down but were secondaries at times during the upgrade process. Looking at the log on one of the mongos servers these members logged the most occurrences of "ExceededTimeLimit: Operation timed out".

Thank you,

Jason

Comment by Kelsey Schubert [ 07/Mar/17 ]

Yes, that's correct diagnostic.data does not contain any user data. Please feel free to review the source code to confirm.

I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted.

Kind regards,
Thomas

Comment by Jason Terpko [ 07/Mar/17 ]

Hi Thomas,

Thank you for assisting, can you confirm the contents of diagnostic.data only include the following data?

serverStatus: db.serverStatus({tcmalloc: true})
replSetGetStatus: rs.status()
collStats for local.oplog.rs: db.getSiblingDB('local').oplog.rs.stats()
getCmdLineOpts: db.adminCommand({getCmdLineOpts: true})
buildInfo: db.adminCommand({buildInfo: true})
hostInfo: db.adminCommand({hostInfo: true})

Can you also please provide a secure method for uploading these files? One that is not accessible by the public.

Jason

Comment by Kelsey Schubert [ 07/Mar/17 ]

Hi jason_or,

Would you please upload an archive of the diagnostic.data directory from a primary node in the sharded cluster and identify when this issue was affecting it?

Thank you,
Thomas

Generated at Thu Feb 08 04:17:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.