Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Done
Priority: Critical - P2
Fix Version/s: 3.2.8, 3.3.10
Affects Version/s: None
Component/s: WiredTiger
Labels:
- code-only

Backwards Compatibility:
Fully Compatible
Backport Completed:

3.2.8
Case:
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When cache utilization hits 95% performance falls off a cliff, severely impacting production.

If the solution to this isn't to (gently) keep utilization from hitting 95%, then do we need to look at why threads getting involved in evictions at 95% is so impactful? Note that on in the incident on the primary that bruce.lucas analyzed it appeared to me that the shortfall between evictions required to keep the cache steady and actual evictions was only 0.5%, yet the impact on operation rates and latencies to get application threads involved in evictions seemed far out of proportion to the shortfall that they had to make up.

If on the other hand evictions are really so fundamentally difficult that increasing eviction rate by 0.5% is hard, does it make sense to look at it from the other end, throttling application threads by the 0.5% required (in this example) to make up the shortfall by very slightly reducing rate of pages read into cache? A similar analysis of the lag incident on the secondary showed that the shortfall was about 9%, yet making up that shortfall when the cache hit 95% utilization nearly brings replication to a halt for extended periods.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

18-second-gap.png
296 kB
Jun 16 2016 08:03:06 PM UTC
cs31295.png
212 kB
Jun 20 2016 06:19:06 PM UTC
incident-06-12-comparison.png
164 kB
Jun 15 2016 07:06:14 PM UTC
incident-06-18.png
116 kB
Jun 18 2016 08:16:45 PM UTC
incident-06-18-server.png
50 kB
Jun 18 2016 08:16:45 PM UTC
incident-06-18-waiters.png
239 kB
Jun 18 2016 08:16:45 PM UTC
primary-transition.png
306 kB
Jun 15 2016 08:14:55 PM UTC
s1646-2.png
234 kB
Jun 23 2016 07:58:53 PM UTC
S1646patch.png
339 kB
Jun 23 2016 03:04:32 PM UTC
s1646-stacks.png
141 kB
Jun 23 2016 07:58:53 PM UTC
secondary-transition.png
247 kB
Jun 15 2016 08:45:22 PM UTC
server-24580-patched-recovery.png
113 kB
Jun 24 2016 07:20:13 AM UTC
stalls1-patched.png
94 kB
Jun 21 2016 07:29:05 AM UTC
stalls1-unpatched.png
90 kB
Jun 21 2016 07:29:05 AM UTC
stalls2-patched.png
79 kB
Jun 21 2016 07:38:00 AM UTC
stalls2-unpatched.png
96 kB
Jun 21 2016 07:29:05 AM UTC

depends on

WT-2702 Under high thread load, WiredTiger exceeds cache size

Closed

is duplicated by

SERVER-23001 Occasional 100% cache uses cripples server

Closed

SERVER-24094 Server cache use can take up to an hour to recover from heavy load

Closed

SERVER-24139 Insert speed decrease rapidly

Closed

SERVER-24983 Remove method really slow

Closed

Assignee:: Michael Cahill (Inactive)
Reporter:: Michael Cahill (Inactive)
Participants:: Bruce Lucas, Chad Kreimendahl, George Heppner, Githook User, kimmy-github, Michael Cahill, Ramon Fernandez Marina
Votes:: 5 Vote for this issue
Watchers:: 57 Start watching this issue

Created:: Jun 15 2016 12:00:39 AM UTC
Updated:: Sep 20 2017 05:48:34 PM UTC
Resolved:: Jul 05 2016 06:51:31 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates