[SERVER-26386] Application Response Times Sky Rockets After Switching To A Freshly Upgraded 3.2.9 Primary Node - Cursor Exhaustion Spotted Created: 29/Sep/16  Updated: 20/Nov/16  Resolved: 20/Nov/16

Status: Closed
Project: Core Server
Component/s: Concurrency, WiredTiger
Affects Version/s: 3.2.9
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Avraham Kalvo Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

We are running on Debian 7


Attachments: File 3012_member.log.tar.gz     File 3012_member_server_status.out     File 3210_primary_cursor_exahustion_log.tar.gz     File 329_member.log.tar.gz     File 329_member_diagnostic_data.tar.gz     File 329_member_server_status.out     File diagnostic.data.tar.gz     PNG File s26386.png    
Issue Links:
Related
is related to SERVER-27132 Mongo response time increased in 15-2... Closed
Operating System: ALL
Steps To Reproduce:

Upgrade primary to 3.2.9
Configure WT cache to a predefined value

Participants:

 Description   

We have upgraded a production replica set from 3.0.12 to 3.2.9, secondary members first, leaving the two last members (primary and its datacenter active fallback) to the end of the process.

The other day we've changed priorities between the afore mentioned two remaining 3.0.12 members, upgraded one to 3.2.9 and then set it back as a primary node, leaving us with one secondary member left as 3.0.12

Once that newly upgraded 3.2.9 host (172.16.144.3) became primary again (after DB restart, to have the new binaries apply),
Application using this replica set had their performance and response times were deteriorating rapidly.

Node log file was showing "cursorExhausted:1" for almost every query logged,
which wasn't occurring in any of the other members.

When examined server status for cursor metrics, number of timed out cursors was rising gradually while number of pinned cursors was around 100 and number of no timeout cursors was around 70.

One thing that was changed apart from upgrading this primary node to 3.2.9 was to explicitly set its maximum configured WT cache size to 36Gb (based on the new 3.2 wiredTiger rule of thumb saying it's 60% of free physical RAM minus one Gb)

We suspected that the fact that the recently upgraded 3.2.9 node was just restarted, thus having its cache empty - being completely "cold" when massive application traffic started performing reads and writes, was the root cause for these cursor exhaustion and performance drop - so we shortly after fallen back to the left 3.0.12 secondary (172.16.23.3) to become primary - which has resulted an immediate significant performance improvement and a stop to cursor exhaustion.

Please note that on the 3.0.12 we had no explicit configuration of cacheSizeGB but left it to the default behaviour of wiredTiger 3.2.9.

After primary was set on the last remaining 3.0.12, we decided to have leave the 3.2.9 node which failed to take the load as primary (172.16.144.3) to "pre heat" its cache for on query traffic (about 600 read statements per second) and gave it another go as primary the day after (without restarting it, as it was already set on version 3.2.9 - yet the same behaviour of cursor exhaustion and massive app performance drop occurred again forcing yet another fallback to the 3.0.12 node to become primary again - which again mitigated things back to what they were before the change.

Enclosed please find are log files from both the 3.2.9 and 3.0.12 nodes, notice how same queries generate different outcomes in terms of cursor exhaustion even though they share the same execution plans, one extremely popular query is the one issuing find on lists.items-Posts in its different permutations:

329_member.log.tar.gz: for host 172.16.144.3, I've omitted the biggest log (1.5G - which covered Sep 29 09:30 to 12:43) as other logs also contain the symptoms reported here.

3012_member.log.tar.gz: for host 172.16.23.3

329_member_server_status.out
3012_member_server_status.out

We have other applications based on other replica sets which are fully upgraded to 3.2.9 (including primary of course) which doesn't display this behaviour - so this could very much have to do with the way the application driver is setup or simply on how its written (or a mix of both...)

Kindly try and assist in analysing how come this behaviour occurs and recommend of methods to try and overcome it.
This is quite urgent for us as we wish to have it completed by the end of next week.

Many thanks in advance,
Avi K, DBA
WiX.COM



 Comments   
Comment by Avraham Kalvo [ 20/Nov/16 ]

You may close this service request now,
I've opened a new ticket as per performance:
SERVER-27132
Mongo response time increased in 15-20% after upgrade to 3.2.10

Thanks,
Avi K

Comment by Avraham Kalvo [ 15/Nov/16 ]

Many thanks,
Will do and update here - so you can close this one.

Comment by Kelsey Schubert [ 15/Nov/16 ]

Hi avrahamk,

Would you please open a new ticket with the diagnostic.data and more details about what you are observing?

Thank you,
Thomas

Comment by Avraham Kalvo [ 15/Nov/16 ]

Thanks Thomas,

Actually we have,
I'm currently investigating a specific case which presents performance degradation after upgrading to 3.2.10.

May I share this here or log in a new ticket?

Thanks,
Avi K.

Comment by Kelsey Schubert [ 14/Nov/16 ]

Hi avrahamk,

I wanted to check in regarding your recent performance. Have you encountered any performance issues since your upgrade to MongoDB 3.2.10?

Thanks,
Thomas

Comment by Avraham Kalvo [ 05/Oct/16 ]

diagnostic data directory for newly upgraded primary instance

Comment by Avraham Kalvo [ 05/Oct/16 ]

log file for newly upgraded instance

Comment by Avraham Kalvo [ 05/Oct/16 ]

Thanks Thomas,

We've upgraded the replica set to 3.2.10, left just one to be set on 3.0.12 again (for fall back)
and switched primary member to the newly upgraded node.

The previous symptoms in terms of performance and user experience drops didn't reoccur as for now,
yet signs of vast cursor exhaustion is still visible whilst examining the instance log file (enclosed please find).

As such cursor exhaustions were not apparent for this instance prior to when it was upgraded from 3.0.12 to 3.2.9 and then 3.2.10, We are concerned that this might gradually build up to a state of denial of service as we experienced more rapidly last week, which brought to the opening of this support ticket eventually.

Kindly review the new log file enclosed here, as well as the new diagnostic directory for this instance.

Bear in mind this member was again restarted upon upgrade.

Many thanks,
Avi Kalvo
WiX.COM

Comment by Kelsey Schubert [ 01/Oct/16 ]

Hi avrahamk,

1.Would you recommend removing WT cache limit configuration for a host where a single instance is running? might this assist to eliminate cursor exahustion?

No, we often see that increasing the WT cache limit has a negative performance impact. There is less space for the filesystem cache, and if the WT cache fills the same eviction issues are likely to manifest as before.

2. What specific patterns in the workload has caused the unoptimised WT handling? are there things in the app workload that we can change/fix/improve in order to improve the situation to begin with? Can you pinpoint specific statements / configurations that we can try to avoid/bypass?

You can read about the improvements included in MongoDB 3.2.10 in this comment. If this issue is still present after upgrading to MongoDB 3.2.10, we'll continue to investigate the root cause.

Please note that a general question about possible optimizations may be best suited for mongodb-users group, which provides a space for MongoDB-related support.

3. When is 3.2.10 supposed to be released?

It's available now: see our downloads page. Please upgrade to MongoDB 3.2.10, and let us if it has resolved the issue.

Thank you,
Thomas

Comment by Avraham Kalvo [ 01/Oct/16 ]

Thanks for you prompt response Thomas,
Three things i have to say in light of your latest feedback, if you can relate to each please:

1.Would you recommend removing WT cache limit configuration for a host where a single instance is running? might this assist to eliminate cursor exahustion?

2. What specific patterns in the workload has caused the unoptimised WT handling? are there things in the app workload that we can change/fix/improve in order to improve the situation to begin with? Can you pinpoint specific statements / configurations that we can try to avoid/bypass?

3. When is 3.2.10 supposed to be released?

Thanks,
Avi K
WiX

Comment by Kelsey Schubert [ 30/Sep/16 ]

Hi avrahamk,

We have analyzed the log files you uploaded, and it appears as though the cause is that WiredTiger cache management is not able to keep up with the workload. We have been working hard on improving the efficiency of WiredTiger cache management since the 3.2.9 release. The particular characteristics of the data you uploaded don't match workloads we have been optimizing, but there is still a good chance that the upcoming 3.2.10 release will fix the performance degradation you are seeing. Would you please try upgrading again once the new release is available?

Thank you,
Thomas

Comment by Avraham Kalvo [ 29/Sep/16 ]

Enclosed please find, as per you request.

Comment by Kelsey Schubert [ 29/Sep/16 ]

Hi avrahamk,

Thanks for opening this ticket. We are investigating the issue. To help, would you please provide an archive of the diagnostic.data directory for the 3.2.9 node?

Kind regards,
Thomas

Generated at Thu Feb 08 04:11:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.