[SERVER-26386] Application Response Times Sky Rockets After Switching To A Freshly Upgraded 3.2.9 Primary Node - Cursor Exhaustion Spotted Created: 29/Sep/16 Updated: 20/Nov/16 Resolved: 20/Nov/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency, WiredTiger |
| Affects Version/s: | 3.2.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Avraham Kalvo | Assignee: | Kelsey Schubert |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
We are running on Debian 7 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | Upgrade primary to 3.2.9 |
||||||||
| Participants: | |||||||||
| Description |
|
We have upgraded a production replica set from 3.0.12 to 3.2.9, secondary members first, leaving the two last members (primary and its datacenter active fallback) to the end of the process. The other day we've changed priorities between the afore mentioned two remaining 3.0.12 members, upgraded one to 3.2.9 and then set it back as a primary node, leaving us with one secondary member left as 3.0.12 Once that newly upgraded 3.2.9 host (172.16.144.3) became primary again (after DB restart, to have the new binaries apply), Node log file was showing "cursorExhausted:1" for almost every query logged, When examined server status for cursor metrics, number of timed out cursors was rising gradually while number of pinned cursors was around 100 and number of no timeout cursors was around 70. One thing that was changed apart from upgrading this primary node to 3.2.9 was to explicitly set its maximum configured WT cache size to 36Gb (based on the new 3.2 wiredTiger rule of thumb saying it's 60% of free physical RAM minus one Gb) We suspected that the fact that the recently upgraded 3.2.9 node was just restarted, thus having its cache empty - being completely "cold" when massive application traffic started performing reads and writes, was the root cause for these cursor exhaustion and performance drop - so we shortly after fallen back to the left 3.0.12 secondary (172.16.23.3) to become primary - which has resulted an immediate significant performance improvement and a stop to cursor exhaustion. Please note that on the 3.0.12 we had no explicit configuration of cacheSizeGB but left it to the default behaviour of wiredTiger 3.2.9. After primary was set on the last remaining 3.0.12, we decided to have leave the 3.2.9 node which failed to take the load as primary (172.16.144.3) to "pre heat" its cache for on query traffic (about 600 read statements per second) and gave it another go as primary the day after (without restarting it, as it was already set on version 3.2.9 - yet the same behaviour of cursor exhaustion and massive app performance drop occurred again forcing yet another fallback to the 3.0.12 node to become primary again - which again mitigated things back to what they were before the change. Enclosed please find are log files from both the 3.2.9 and 3.0.12 nodes, notice how same queries generate different outcomes in terms of cursor exhaustion even though they share the same execution plans, one extremely popular query is the one issuing find on lists.items-Posts in its different permutations: 329_member.log.tar.gz: for host 172.16.144.3, I've omitted the biggest log (1.5G - which covered Sep 29 09:30 to 12:43) as other logs also contain the symptoms reported here. 3012_member.log.tar.gz: for host 172.16.23.3 329_member_server_status.out We have other applications based on other replica sets which are fully upgraded to 3.2.9 (including primary of course) which doesn't display this behaviour - so this could very much have to do with the way the application driver is setup or simply on how its written (or a mix of both...) Kindly try and assist in analysing how come this behaviour occurs and recommend of methods to try and overcome it. Many thanks in advance, |
| Comments |
| Comment by Avraham Kalvo [ 20/Nov/16 ] |
|
You may close this service request now, Thanks, |
| Comment by Avraham Kalvo [ 15/Nov/16 ] |
|
Many thanks, |
| Comment by Kelsey Schubert [ 15/Nov/16 ] |
|
Hi avrahamk, Would you please open a new ticket with the diagnostic.data and more details about what you are observing? Thank you, |
| Comment by Avraham Kalvo [ 15/Nov/16 ] |
|
Thanks Thomas, Actually we have, May I share this here or log in a new ticket? Thanks, |
| Comment by Kelsey Schubert [ 14/Nov/16 ] |
|
Hi avrahamk, I wanted to check in regarding your recent performance. Have you encountered any performance issues since your upgrade to MongoDB 3.2.10? Thanks, |
| Comment by Avraham Kalvo [ 05/Oct/16 ] |
|
diagnostic data directory for newly upgraded primary instance |
| Comment by Avraham Kalvo [ 05/Oct/16 ] |
|
log file for newly upgraded instance |
| Comment by Avraham Kalvo [ 05/Oct/16 ] |
|
Thanks Thomas, We've upgraded the replica set to 3.2.10, left just one to be set on 3.0.12 again (for fall back) The previous symptoms in terms of performance and user experience drops didn't reoccur as for now, As such cursor exhaustions were not apparent for this instance prior to when it was upgraded from 3.0.12 to 3.2.9 and then 3.2.10, We are concerned that this might gradually build up to a state of denial of service as we experienced more rapidly last week, which brought to the opening of this support ticket eventually. Kindly review the new log file enclosed here, as well as the new diagnostic directory for this instance. Bear in mind this member was again restarted upon upgrade. Many thanks, |
| Comment by Kelsey Schubert [ 01/Oct/16 ] |
|
Hi avrahamk,
No, we often see that increasing the WT cache limit has a negative performance impact. There is less space for the filesystem cache, and if the WT cache fills the same eviction issues are likely to manifest as before.
You can read about the improvements included in MongoDB 3.2.10 in this comment. If this issue is still present after upgrading to MongoDB 3.2.10, we'll continue to investigate the root cause. Please note that a general question about possible optimizations may be best suited for mongodb-users group, which provides a space for MongoDB-related support.
It's available now: see our downloads page. Please upgrade to MongoDB 3.2.10, and let us if it has resolved the issue. Thank you, |
| Comment by Avraham Kalvo [ 01/Oct/16 ] |
|
Thanks for you prompt response Thomas, 1.Would you recommend removing WT cache limit configuration for a host where a single instance is running? might this assist to eliminate cursor exahustion? 2. What specific patterns in the workload has caused the unoptimised WT handling? are there things in the app workload that we can change/fix/improve in order to improve the situation to begin with? Can you pinpoint specific statements / configurations that we can try to avoid/bypass? 3. When is 3.2.10 supposed to be released? Thanks, |
| Comment by Kelsey Schubert [ 30/Sep/16 ] |
|
Hi avrahamk, We have analyzed the log files you uploaded, and it appears as though the cause is that WiredTiger cache management is not able to keep up with the workload. We have been working hard on improving the efficiency of WiredTiger cache management since the 3.2.9 release. The particular characteristics of the data you uploaded don't match workloads we have been optimizing, but there is still a good chance that the upcoming 3.2.10 release will fix the performance degradation you are seeing. Would you please try upgrading again once the new release is available? Thank you, |
| Comment by Avraham Kalvo [ 29/Sep/16 ] |
|
Enclosed please find, as per you request. |
| Comment by Kelsey Schubert [ 29/Sep/16 ] |
|
Hi avrahamk, Thanks for opening this ticket. We are investigating the issue. To help, would you please provide an archive of the diagnostic.data directory for the 3.2.9 node? Kind regards, |