Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Critical - P2
Fix Version/s: None
Affects Version/s: 3.2.9
Component/s: Concurrency, WiredTiger
Labels:
None
Environment:
We are running on Debian 7

Operating System:
ALL
Steps To Reproduce:

Hide

Upgrade primary to 3.2.9
Configure WT cache to a predefined value

Show
Upgrade primary to 3.2.9 Configure WT cache to a predefined value
Confidence Status:
None
Work Order:
0

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

We have upgraded a production replica set from 3.0.12 to 3.2.9, secondary members first, leaving the two last members (primary and its datacenter active fallback) to the end of the process.

The other day we've changed priorities between the afore mentioned two remaining 3.0.12 members, upgraded one to 3.2.9 and then set it back as a primary node, leaving us with one secondary member left as 3.0.12

Once that newly upgraded 3.2.9 host (172.16.144.3) became primary again (after DB restart, to have the new binaries apply),
Application using this replica set had their performance and response times were deteriorating rapidly.

Node log file was showing "cursorExhausted:1" for almost every query logged,
which wasn't occurring in any of the other members.

When examined server status for cursor metrics, number of timed out cursors was rising gradually while number of pinned cursors was around 100 and number of no timeout cursors was around 70.

One thing that was changed apart from upgrading this primary node to 3.2.9 was to explicitly set its maximum configured WT cache size to 36Gb (based on the new 3.2 wiredTiger rule of thumb saying it's 60% of free physical RAM minus one Gb)

We suspected that the fact that the recently upgraded 3.2.9 node was just restarted, thus having its cache empty - being completely "cold" when massive application traffic started performing reads and writes, was the root cause for these cursor exhaustion and performance drop - so we shortly after fallen back to the left 3.0.12 secondary (172.16.23.3) to become primary - which has resulted an immediate significant performance improvement and a stop to cursor exhaustion.

Please note that on the 3.0.12 we had no explicit configuration of cacheSizeGB but left it to the default behaviour of wiredTiger 3.2.9.

After primary was set on the last remaining 3.0.12, we decided to have leave the 3.2.9 node which failed to take the load as primary (172.16.144.3) to "pre heat" its cache for on query traffic (about 600 read statements per second) and gave it another go as primary the day after (without restarting it, as it was already set on version 3.2.9 - yet the same behaviour of cursor exhaustion and massive app performance drop occurred again forcing yet another fallback to the 3.0.12 node to become primary again - which again mitigated things back to what they were before the change.

Enclosed please find are log files from both the 3.2.9 and 3.0.12 nodes, notice how same queries generate different outcomes in terms of cursor exhaustion even though they share the same execution plans, one extremely popular query is the one issuing find on lists.items-Posts in its different permutations:

329_member.log.tar.gz: for host 172.16.144.3, I've omitted the biggest log (1.5G - which covered Sep 29 09:30 to 12:43) as other logs also contain the symptoms reported here.

3012_member.log.tar.gz: for host 172.16.23.3

329_member_server_status.out
3012_member_server_status.out

We have other applications based on other replica sets which are fully upgraded to 3.2.9 (including primary of course) which doesn't display this behaviour - so this could very much have to do with the way the application driver is setup or simply on how its written (or a mix of both...)

Kindly try and assist in analysing how come this behaviour occurs and recommend of methods to try and overcome it.
This is quite urgent for us as we wish to have it completed by the end of next week.

Many thanks in advance,
Avi K, DBA
WiX.COM

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

3012_member_server_status.out
14 kB
Sep 29 2016 01:07:35 PM UTC
3012_member.log.tar.gz
18.41 MB
Sep 29 2016 01:07:35 PM UTC
3210_primary_cursor_exahustion_log.tar.gz
10.34 MB
Oct 05 2016 11:36:26 AM UTC
329_member_diagnostic_data.tar.gz
39.40 MB
Sep 29 2016 02:37:02 PM UTC
329_member_server_status.out
19 kB
Sep 29 2016 01:07:35 PM UTC
329_member.log.tar.gz
128.21 MB
Sep 29 2016 01:07:35 PM UTC
diagnostic.data.tar.gz
92.36 MB
Oct 05 2016 12:15:13 PM UTC
s26386.png
52 kB
Oct 06 2016 01:09:54 AM UTC

is related to

SERVER-27132 Mongo response time increased in 15-20% after upgrade to 3.2.10

Closed

Assignee:: Kelsey Schubert

Reporter:: Avraham Kalvo

Participants:: Avraham Kalvo, Kelsey Schubert

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: Sep 29 2016 01:07:35 PM UTC

Updated:: Nov 20 2016 12:31:13 PM UTC

Resolved:: Nov 20 2016 12:31:13 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates