[SERVER-28232] Performance degradation after upgrade from 3.0.11 to 3.2.12 Created: 07/Mar/17 Updated: 08/Jan/24 Resolved: 27/Nov/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance |
| Affects Version/s: | 3.2.12 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jason Terpko | Assignee: | Kelsey Schubert |
| Resolution: | Done | Votes: | 3 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
Environment: CentoOS 6 (kernel 2.6.32), MongoDB community edition 3.0.11 and 3.2.12 , storage engine MMAP, config servers on SCCC We are experiencing performance degradation when moving from 3.0.11 to 3.2.12. Application throughput is getting reduced by 5-10 times in 3.2.12 compared to 3.0.11. In the past, we had attempted to upgrade from 3.0.11 to 3.2.8 but due to https://jira.mongodb.org/browse/SERVER-26159 bug, we rollback to 3.0.11. In 3.2.8 application throughput was fine but since the mongos were randomly crashing due to The issue we are seeing in the logs after increasing the verbosity from 1 to 2 is the following:
Note: I am not changing the "protocolVersion" to 1 after the 3.0.11 to 3.2.12 upgrade as makes the rollback harder. We managed to reproduce the issue with sysbench-mongodb using 3.2.12 on a 10 nodes sharded cluster, not in the scale we getting it on our production system. To remedy the issue in testing we changed taskExecutorPoolSize value: Our mongos has 6 CPUs so I assume it creates 6 connection pools with defaults. Using a smaller value like "taskExecutorPoolSize"=2 reduces the timeouts so it seems the more connection pools I use the more timeouts I get during the benchmark. When I set "taskExecutorPoolSize"=1, which I believe set a single connection pool, I am not getting the above timeouts. We also modified the ShardingTaskExecutorPoolRefreshTimeoutMS from the default 20 seconds to 60 seconds which also eliminated the timeouts. We combined both on production but unfortunately, the timeouts didn't go away and we still noticed the same performance degradation.
I want to believe that is not our workload that triggering the performance degradation as it operates fine on 3.0.11 The purpose of the thread is to understand what has changed between 3.2.8 and 3.2.12 that might trigger the isMaster request to fail between mongod and mongos. Much appreciated If anyone has internals on the change or is facing the same problem and found a workaround. Thanks in advance, Jason |
| Comments |
| Comment by Kelsey Schubert [ 27/Nov/17 ] | |||||||||||||||||||||
|
Since 3.2.15 has been out for some time, and there hasn't been any additional activity on this ticket, I'm going to resolve it. If this is still an issue after upgrading, please let us know and we'll continue to investigate. Kind regards, | |||||||||||||||||||||
| Comment by Kelsey Schubert [ 27/Nov/17 ] | |||||||||||||||||||||
|
Please feel free to review our documentation on these parameters. If you have suggestions to improve our documentation, would you please open a DOCS ticket? Thank you, | |||||||||||||||||||||
| Comment by Dharshan Rangegowda [ 13/Nov/17 ] | |||||||||||||||||||||
|
Hi folks, Is it possible to get the details of the connPoolMaxShardedConnsPerHost documented? A few more questions on Sam's comments above | |||||||||||||||||||||
| Comment by Mira Carey [ 23/Jun/17 ] | |||||||||||||||||||||
|
That sounds good, I'm happy to leave this open until 3.2.15 comes out and revisit then. | |||||||||||||||||||||
| Comment by Antonis Giannopoulos [ 22/Jun/17 ] | |||||||||||||||||||||
|
A brief update: We are currently deploying 3.2.14 with the following settings to production clusters
We have started deployment using a busy ascending order and so far the workaround is solid. We would like to keep the ticket open until 3.2.15 is out. Currently, there is a cost during a mongos restart as it has to initialize the connections in advance (fixed cost: ShardingTaskExecutorPoolMinSize*Shards* connection pools). it's not a huge issue but we would like to investigate the impact of Thanks, | |||||||||||||||||||||
| Comment by Samantha Ritter (Inactive) [ 17/May/17 ] | |||||||||||||||||||||
|
Hi Ant, Yes, you are seeing log messages from two different families of pools here. For historical reasons, we use two separate kinds of connection pools on the write path and on the read path. The read path uses the NetworkInterfaceASIO pools. All of the ShardingTaskExecutorPool* parameters that you've been working with affect NetworkInterfaceASIO pools only. On the write path, we use connection pools called DBClient pools. These have a default maximum capacity of 200. The fact that you're seeing so many log messages about connections in this pool being dropped means you are churning connections, because you consistently need more than 200 at any given time. So, we open more connections as-needed to meet the demands on the pool, but we only ever keep 200 of them around for re-use. I would suggest that you set this pool size to something higher, and see if that helps with both your performance issues and helps decrease the churn and log messages. You can do this with the following:
and
When examining logging, the component part of the message should tell you which part of the system it's coming from. The component is the part of the log line in square brackets. The NetworkInterfaceASIO pools (the read path) log with component NetworkInterfaceASIO, and the DBClient-based (write) pools log with various other components, based on what the pool is being used for. Let me know if this makes sense, I know there are a lot of moving parts here. Thanks, | |||||||||||||||||||||
| Comment by Antonis Giannopoulos [ 17/May/17 ] | |||||||||||||||||||||
|
I was testing with 3.2.13 version and the following settings
and I received tons of Informational messages (27436 in 1 hour) of the type:
The number of connections on the message varies but its constantly over 200. Is there a maxPoolSize=200 limit enforced on each pool? or this is a different pool? Thanks, | |||||||||||||||||||||
| Comment by Jon Hyman [ 14/May/17 ] | |||||||||||||||||||||
|
Hi Jason, Wayne has tried setting various configuration values in our case to no avail. The last configuration we tried was this: setParameter: We did not set a max pool size, I am fearful of limiting connections. We have thousands of processes which connect to our mongoS, in total we run tens of thousands of application server threads which may talk to the database and I am fearful of limiting connections to something as small as 200*num_mongos*num_cores_per_mongos. In thinking about this ticket and our problems, are you continuing to do more investigation based on these settings and what works to make changes to how the router works, or is your current approach that we should continue to tweak these settings as needed? The problem as I see with the latter that when the ExceededTimeLimit issue occurs, we have downtime of some kind due to read/write failures to the mongoS. It's a really bad situation when that happens. The instances that we're running 3.2.12 on right now have lower load by more than an order of magnitude from our busiest mongo instances, so I am extremely concerned that even if we get these settings right on one instance as Antonios has – we now have a 3.2.12 instance that we haven't seen the problem on, but it's about 4-10x smaller/less busy than the one Wayne is reporting problems on, which is at least 10x smaller/less busy than our busiest instances – that we're going to be relying on fragile configuration that may not support more application server growth or throughput. | |||||||||||||||||||||
| Comment by Antonis Giannopoulos [ 13/May/17 ] | |||||||||||||||||||||
|
Hi Jason, At the moment we run only one 3.2.12 sharded cluster in production with:
We made the change on May 08 and so far we haven't seen any "ExceededTimeLimit: Operation timed out" exception. The change seems to be on the right direction but compared to our other clusters, this one is considered as quiet from traffic perspective, so we can't assume that is an acceptable workaround before we test it against a busier cluster, which is an action item for the near future. Beyond production deployment, I tested various configuration to our test environment that we are able to reproduce the exception using mongobench. Before each test I restart the mongos and application servers and the duration of each the test is 1 hour. On the first round of tests I was increasing the traffic on the cluster gradually by adding more application servers on the test: Configuration01: We didn't hit the exception
Configuration02: We hit the exception on random times
Configuration03: We hit the exception on random times
Configuration04: We hit the exception on random times
Configuration05: We hit the exception on random times
Configuration05: We hit the exception on random times
On the second round of tests I was starting all application servers at once - in order to simulate a spike: Configuration06: We hit the exception' when the application servers started for around 30 seconds and then didn't appear again.
Configuration07: We hit the exception' when the application servers started for around 1-2 minutes and then didn't appear again.
Configuration08: We hit the exception' when the application servers started for around 10 seconds and then didn't appear again.
Configuration09: We didn't hit the exception
For configurations06-08 I assume we got the exception in the beggining as the connection pool wasn't fully initialized yet. For test Configuration07 (which was the ugliest) I also received exceptions of the type "ExceededTimeLimit: Couldn't get a connection within the time limit" which also lead me to the conclusion that mongos needs time to prepare the pools/connections. For Configuration09 I believe the ShardingTaskExecutorPoolRefreshTimeoutMS absorve the errors but (as expected) it didn't help the mongos to prepare the pools faster (the application start reported traffic almost the same time as Configuration08). The only observation I can't explain is why Configuration06 produced more errors than Configuration08. My assumption was that they will produce the same. The next step was to test configurations06-08 without restarting the mongos (not a cold restart) . During these tests we didn't hit the exception. I assume because of the fact that the pools/connections were already there. Additionally during these tests the application reported traffic faster than the ones I restarted the mongos (the application servers were still restarted). My next of set (ongoing) is adding various values for "taskExecutorPoolSize" for Configuration09. When I finalize the optimal configuration which I think it will be Configuration09 I am also going to test it against global/partial step-downs. My conclusion so far is that the connection spikes seems to causing the abnormal behavior, so maintaining open connections is the right path. I am not sure if in this phase of testing I need to add ShardingTaskExecutorPoolMaxSize on the equation. I might give it a try in later stages. Theoretically the value of 200 will limit the concurrent operations to 200*num_mongos*num_connection_pools, correct? Thanks, | |||||||||||||||||||||
| Comment by Mira Carey [ 12/May/17 ] | |||||||||||||||||||||
|
Responding here, to your comments on I've made an attempt to correlate some events in the logs you uploaded to our support site and I have a story for how 3.2.8 might have helped you, as well as why it may not be a full solution. I filtered my view down to one of the uploaded app servers, and then to egress connections to one particular shard. I'm omitting the names here as they aren't otherwise available on this ticket and I'm not sure if you want those names to be public. For what counts as an egress connection, it's anything logged by the task executor with "Connecting to", "Successfully connected to" or "Failed to connect to" Given that view, I see the queue of "connecting" connections rise from 0 to a peak of 46 from 7:09:06 - 7:09:30. 59 total connections are opened during that time. Then, from 7:12:41 - 7:14:04, 206 total connections are opened, peaking with 159 in the "connecting" state Finally, from 7:19:58 - 7:21:12, 160 connections are opened, before the first "ExceededTimeLimit" appears, where previously 261 connections were in the "connecting" state. That's a behavior pattern that might indicate bursts of traffic, with connection dumping in between (in which case setting min pool size would help by preventing that dumping). It might also indicate bursts of new traffic, as those log lines would present in the same way if we simply added more load. In that latter case, you might do better to set max pool size, as that would throttle the opening of new connections and prevent the connections spikes that you're clearly experiencing. If it's a mixture of both, and you don't have too many long running operations, you might be best served by a combination of both approaches:
Such a configuration would only drop connections from a pool after an hour without traffic and would steadily maintain in the range of 100-200 connections to any given host. It seems that your shards are capable of absorbing around 100 new connections in a burst, but drown out and enter timeout cycles at around 200. In the same vein as my comment to antogiann, I'd be interested in seeing what your logs look like with the various mongos options enabled. | |||||||||||||||||||||
| Comment by Mira Carey [ 12/May/17 ] | |||||||||||||||||||||
|
Hi antogiann, Now that a few days have passed, how has the workaround described in I'd be interested in knowing if things have stayed stable for you, or what sort of behavior you've been seeing if not. Regards, | |||||||||||||||||||||
| Comment by Antonis Giannopoulos [ 08/May/17 ] | |||||||||||||||||||||
|
Hi, We deployed in production (version 3.2.12) one of the workarounds described on
The ShardingTaskExecutorPoolMinSize came from the fact that the shards were maintain around 2000 connections prior to the change. We are running with 4 mongos and each mongos has 6 CPUs (so 6 connection pools) so I am expecting to have at least 4*6*100=2400 connections which is close to what the cluster needs. There is no good reason why we picked ShardingTaskExecutorPoolHostTimeoutMS= 300000000. It was the only value that during the benchmarks in our test envairoment didn't produced the exception but as the test continues we might be able to determine a lower value for it. So far I haven't noticed any "Failed to connect to - ExceededTimeLimit: Operation timed out" on the logs but its only 4 hours since we made the change. I will have some feedback after 24-48 hours and when the ShardingTaskExecutorPoolHostTimeoutMS gets exhausted. Ant | |||||||||||||||||||||
| Comment by Antonis Giannopoulos [ 02/May/17 ] | |||||||||||||||||||||
|
Hi, I noticed that 3.2.13 released last week. I am going to try to reproduce the issue against 3.2.13 this week but I was wondering if there are any improvements on logging that might help us reveal the source of the error? Thanks, | |||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 12/Apr/17 ] | |||||||||||||||||||||
|
Hi antogiann - If you are running 2.12 then I believe we can rule out the glibc bug as a cause. Thanks for letting us know: it is helpful to us to eliminate that as a potential cause for the performance issue you are experiencing. | |||||||||||||||||||||
| Comment by Antonis Giannopoulos [ 11/Apr/17 ] | |||||||||||||||||||||
|
Hi Thomas Its 2.12. I am aware of https://bugzilla.kernel.org/show_bug.cgi?id=99671 so I had strace logging 3.2.12 mongos calls during the benchmark but I didn't catch any abnormal behavior. Thanks, | |||||||||||||||||||||
| Comment by Kelsey Schubert [ 11/Apr/17 ] | |||||||||||||||||||||
|
Hi jonhyman, We're continuing to investigate this issue. To help us rule out some theories, would you please provide the version of glibc that you are using by executing the following command?
Thank you, | |||||||||||||||||||||
| Comment by Jon Hyman [ 11/Apr/17 ] | |||||||||||||||||||||
|
Hi @Thomas Schubert, do you have an update on this ticket? I also wanted to point out | |||||||||||||||||||||
| Comment by Kelsey Schubert [ 17/Mar/17 ] | |||||||||||||||||||||
|
Hi jason_or, We're continuing to examine the data you've provided. The heavy load I identified previously (a spike in commands/s) does not appear to be correlated to timeouts recorded in the logs. We'll update this ticket as our investigation continues. Thank you for your patience, | |||||||||||||||||||||
| Comment by Jason Terpko [ 14/Mar/17 ] | |||||||||||||||||||||
|
Hi Thomas, Were you able to review the additional information and provided logs? Thank you, Jason | |||||||||||||||||||||
| Comment by Jason Terpko [ 09/Mar/17 ] | |||||||||||||||||||||
|
Hi Thomas, During the upgrade process the application was paused at times allowing for actions like mongos restarts, step downs, or if the application queue backlogged. The application was paused at the start of that window to perform a rolling mongos restart, at this time we increased taskExecutorPoolSize from 1 to 2 as test prior to rolling back. The application was un-paused at various times during that ~1 hour log sample. Pausing and un-pausing of the application during this process is normal and the clusters handle this just fine pre-3.2.10. A very similar workload (i.e. throughput) ran fine on different cluster running 3.2.8 but we encountered Because the diagnostics files are binary we can't read them, what metric in the file showed significant load? This should be due to new connections from the rolling mongos restart and un-pausing of the application. I am collecting the mongos log files now to be uploaded to the portal. Minus authentication and query log lines. Note, my team member Antonios may also be commenting on this Jira issue based on availability. Thank you, Jason | |||||||||||||||||||||
| Comment by Kelsey Schubert [ 08/Mar/17 ] | |||||||||||||||||||||
|
Hi jason_or, I see that between 2017-03-02T15:44:35.027 UTC and 2017-03-02T15:46:21.476Z UTC, your servers have significantly more load. If not, can you attempt to correlate correlate this time to the time outs in the mongos? Are you able to upload the mongos logs to the secure portal as well? Thank you, | |||||||||||||||||||||
| Comment by Jason Terpko [ 07/Mar/17 ] | |||||||||||||||||||||
|
Thomas, Thank you for confirming the contents of diagnostic.data, I have uploaded metrics from the primaries while running version 3.2.12. According to the log these were the primaries after the first step down but were secondaries at times during the upgrade process. Looking at the log on one of the mongos servers these members logged the most occurrences of "ExceededTimeLimit: Operation timed out". Thank you, Jason | |||||||||||||||||||||
| Comment by Kelsey Schubert [ 07/Mar/17 ] | |||||||||||||||||||||
|
Yes, that's correct diagnostic.data does not contain any user data. Please feel free to review the source code to confirm. I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted. Kind regards, | |||||||||||||||||||||
| Comment by Jason Terpko [ 07/Mar/17 ] | |||||||||||||||||||||
|
Hi Thomas, Thank you for assisting, can you confirm the contents of diagnostic.data only include the following data?
Can you also please provide a secure method for uploading these files? One that is not accessible by the public. Jason | |||||||||||||||||||||
| Comment by Kelsey Schubert [ 07/Mar/17 ] | |||||||||||||||||||||
|
Hi jason_or, Would you please upload an archive of the diagnostic.data directory from a primary node in the sharded cluster and identify when this issue was affecting it? Thank you, |