Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.2.12
Component/s: Performance
Labels:
None

Operating System:
ALL
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Environment: CentoOS 6 (kernel 2.6.32), MongoDB community edition 3.0.11 and 3.2.12 , storage engine MMAP, config servers on SCCC

We are experiencing performance degradation when moving from 3.0.11 to 3.2.12. Application throughput is getting reduced by 5-10 times in 3.2.12 compared to 3.0.11. In the past, we had attempted to upgrade from 3.0.11 to 3.2.8 but due to https://jira.mongodb.org/browse/SERVER-26159 bug, we rollback to 3.0.11. In 3.2.8 application throughput was fine but since the mongos were randomly crashing due to ~~SERVER-26159~~ we rollback to 3.0.11. Bug ~~SERVER-26159~~ fixed in 3.2.10 so we attempt to upgrade but we got our performance reduced so we rollback to 3.0.11 again. We opened a JIRA ~~SERVER-26654~~ about this issue (and several other people report almost the same issues) and according to Jira the issue was solved in 3.2.12. We attempt to upgrade to 3.2.12 but we got the same performance degradation as the 3.2.10 upgrade.

The issue we are seeing in the logs after increasing the verbosity from 1 to 2 is the following:

I ASIO    [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to connect to (node) - ExceededTimeLimit: Operation timed out
D ASIO  [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to execute command: RemoteCommand 23628777 -- target:(node) db:admin cmd:{ 
isMaster: 1 } reason: ExceededTimeLimit: Operation timed out
isMaster command is timeout for different "TaskExecutorPool" all the time.

Note: I am not changing the "protocolVersion" to 1 after the 3.0.11 to 3.2.12 upgrade as makes the rollback harder.

We managed to reproduce the issue with sysbench-mongodb using 3.2.12 on a 10 nodes sharded cluster, not in the scale we getting it on our production system.

To remedy the issue in testing we changed taskExecutorPoolSize value:

Our mongos has 6 CPUs so I assume it creates 6 connection pools with defaults. Using a smaller value like "taskExecutorPoolSize"=2 reduces the timeouts so it seems the more connection pools I use the more timeouts I get during the benchmark. When I set "taskExecutorPoolSize"=1, which I believe set a single connection pool, I am not getting the above timeouts.

We also modified the ShardingTaskExecutorPoolRefreshTimeoutMS from the default 20 seconds to 60 seconds which also eliminated the timeouts.

We combined both on production but unfortunately, the timeouts didn't go away and we still noticed the same performance degradation.

setParameter:
 ShardingTaskExecutorPoolRefreshTimeoutMS: 60000
 taskExecutorPoolSize: 1

I want to believe that is not our workload that triggering the performance degradation as it operates fine on 3.0.11

The purpose of the thread is to understand what has changed between 3.2.8 and 3.2.12 that might trigger the isMaster request to fail between mongod and mongos.

Much appreciated If anyone has internals on the change or is facing the same problem and found a workaround.

Thanks in advance,

Jason

Assignee:: Kelsey Schubert
Reporter:: Jason Terpko
Participants:: Andrew Morrow, Antonis Giannopoulos, Dharshan Rangegowda, Jason Terpko, Jon Hyman, Kelsey Schubert, Mira Carey, Samantha Ritter
Votes:: 3 Vote for this issue
Watchers:: 19 Start watching this issue

Created:: Mar 07 2017 04:58:56 PM UTC
Updated:: Jan 08 2024 03:23:05 PM UTC
Resolved:: Nov 27 2017 09:25:48 PM UTC

Details

Description

Attachments

Activity

People

Dates