[SERVER-36956] Replace the statistic that dynamically resizes the snapshot history window Created: 31/Aug/18  Updated: 29/Oct/23  Resolved: 17/May/19

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 4.1.12

Type: Improvement Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Dianna Hohensee (Inactive)
Resolution: Fixed Votes: 0
Labels: nyc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-40685 Mongos often fails transactions that ... Closed
Problem/Incident
Related
related to SERVER-41244 The WT inMemory storage engine needs ... Closed
is related to SERVER-41561 Undo SERVER-36956 but keep the cache ... Closed
is related to SERVER-41472 Create a workload to demonstrate the ... Closed
Backwards Compatibility: Fully Compatible
Sprint: Storage NYC 2019-05-20
Participants:
Linked BF Score: 45

 Description   

It was suggested to replace WT_STAT_CONN_CACHE_LOOKASIDE_SCORE with CacheStat('cache_lookaside_insert', 'lookaside table insert calls'),



 Comments   
Comment by Githook User [ 09/Jul/19 ]

Author:

{'name': 'Dianna Hohensee', 'username': 'DiannaHohensee', 'email': 'dianna.hohensee@10gen.com'}

Message: Revert "SERVER-36956 SnapshotTooOld errors will always increase the snapshot history window size"

This reverts commit 8899b34e1044b08aec7ad9f8546652456472702c.

(cherry picked from commit 8bb53a07a5c593d85b6229a2afe096b3e1efe21d)
Branch: v4.2
https://github.com/mongodb/mongo/commit/1cccd2d05271c018b702bcc3e30a7516457a192a

Comment by Githook User [ 01/Jul/19 ]

Author:

{'name': 'Dianna Hohensee', 'email': 'dianna.hohensee@10gen.com', 'username': 'DiannaHohensee'}

Message: Revert "SERVER-36956 SnapshotTooOld errors will always increase the snapshot history window size"

This reverts commit 8899b34e1044b08aec7ad9f8546652456472702c.
Branch: master
https://github.com/mongodb/mongo/commit/8bb53a07a5c593d85b6229a2afe096b3e1efe21d

Comment by Dianna Hohensee (Inactive) [ 20/May/19 ]

alexander.gorrod in regards to a workload to exercise the problem, I don't think we have any good ones. I encountered problems with a perf workload I wrote a little less than a year ago; and there's the sharding suite that was failing (the BF linked to this ticket). In both cases, the cache pressure calculation issues were unearthed by adding extra logging about what the score was when accessed. The score fluctuates too much out of our control to get a more direct repro.

I believe Keith is familiar with how the lookaside score operates. First, the score can just sit at 60, say, and never reach 100 to trigger eviction. Second, we don't reset it after eviction, so again we get stuck even after cache pressure recedes.

MongoDB also has to build something specialized on top of whatever WT can provide us, so there isn't really some workload that doesn't work now and then with a WT change would start working. Unless we built something on top of WT first while knowing what WT was going to build.

Comment by Dianna Hohensee (Inactive) [ 17/May/19 ]

alexander.gorrod I think a statistic reporting the actual percent cache usage would be useful. Then looking at that, and maybe the eviction thresholds, we could more finely control the history window size between stable and oldest timestamps so as not to cause cache pressure – or at least signal via logging that the user needs more cache space.

I think we have something that will work for v4.2 – unless I hear otherwise from sharding or drivers.

I wouldn't want any WT work to be done, however, unless we had a plan for how to use it to better control the history window. Particularly with the cache changes WT is introducing in v4.4 for longer running transactions.

Comment by Dianna Hohensee (Inactive) [ 17/May/19 ]

bruce.lucas@mongodb.com, I removed two of the serverStatus.wiredtiger.snapshot-window-settings fields and added two new fields in this patch. The snapshot-window-settings section was introduced back in SERVER-31767 (see this comment).

Comment by Githook User [ 17/May/19 ]

Author:

{'name': 'Dianna', 'email': 'dianna.hohensee@10gen.com', 'username': 'DiannaHohensee'}

Message: SERVER-36956 SnapshotTooOld errors will always increase the snapshot history window size
Branch: master
https://github.com/mongodb/mongo/commit/8899b34e1044b08aec7ad9f8546652456472702c

Comment by Alexander Gorrod [ 15/May/19 ]

dianna.hohensee The intent of the cache lookaside score is that it's an indicator for cache pressure triggered by history requirements. If MongoDB is encountering cases where the score isn't effectively tracking that situation, I'd prefer to update WiredTiger to improve the lookaside score calculation than to search for a solution based on different heuristics.

Could you provide a workload or set of workloads where the lookaside score isn't currently behaving as desired so we can understand why and improve it on the WiredTiger side?

Comment by Dianna Hohensee (Inactive) [ 14/May/19 ]

louis.williams I imagine the score could also just stay at 60, say, and never reach 100. In that case, we would similarly be stuck. The metric is not reliable for our purposes.

Comment by Louis Williams [ 14/May/19 ]

alexander.gorrod the issue described by SERVER-40685 is that the WT_STAT_CONN_CACHE_LOOKASIDE_SCORE metric gets "stuck" at a high value and never drops because eviction threads stop running after the cache pressure dies down. Would it instead be simpler for WiredTiger to change the behavior of that statistic so that it can drop down even when the eviction threads stop running?

Comment by Alexander Gorrod [ 10/May/19 ]

The reason I would recommend using WT_STAT_CONN_CACHE_LOOKASIDE_SCORE is that it is a leading indicator - it should grow high as it becomes more likely for WiredTiger to begin using cache overflow. Wheras WT_STAT_CONN_CACHE_LOOKASIDE_INSERT will only have a meaningful result once we have already started using lookaside. The insert statistic is also a counter - so you'll need to track change over time and calculate a running insert rate.

I wouldn't recommend using WT_STAT_CONN_CACHE_LOOKASIDE_ENTRIES - that could be heavily skewed due to earlier activity, which I don't think would be ideal.

If the lookaside score isn't useful, I'd recommend implementing something that uses a combination of checking whether timestamps are pinned (there are WiredTiger statistics that can tell you that) along with dirty cache usage as a proportion of allowed dirty cache. It's possible to use the cache_bytes_dirty statistic to figure out what proportion of the cache is dirty by comparing it to the configured maximum cache size and the proportion of that which can be dirty which is controlled by the eviction_dirty_trigger configuration setting.

Generated at Thu Feb 08 04:44:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.