[SERVER-36956] Replace the statistic that dynamically resizes the snapshot history window Created: 31/Aug/18 Updated: 29/Oct/23 Resolved: 17/May/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.12 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Dianna Hohensee (Inactive) | Assignee: | Dianna Hohensee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | nyc | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Sprint: | Storage NYC 2019-05-20 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Linked BF Score: | 45 | ||||||||||||||||||||||||||||
| Description |
|
It was suggested to replace WT_STAT_CONN_CACHE_LOOKASIDE_SCORE with CacheStat('cache_lookaside_insert', 'lookaside table insert calls'), |
| Comments |
| Comment by Githook User [ 09/Jul/19 ] |
|
Author: {'name': 'Dianna Hohensee', 'username': 'DiannaHohensee', 'email': 'dianna.hohensee@10gen.com'}Message: Revert " This reverts commit 8899b34e1044b08aec7ad9f8546652456472702c. (cherry picked from commit 8bb53a07a5c593d85b6229a2afe096b3e1efe21d) |
| Comment by Githook User [ 01/Jul/19 ] |
|
Author: {'name': 'Dianna Hohensee', 'email': 'dianna.hohensee@10gen.com', 'username': 'DiannaHohensee'}Message: Revert " This reverts commit 8899b34e1044b08aec7ad9f8546652456472702c. |
| Comment by Dianna Hohensee (Inactive) [ 20/May/19 ] |
|
alexander.gorrod in regards to a workload to exercise the problem, I don't think we have any good ones. I encountered problems with a perf workload I wrote a little less than a year ago; and there's the sharding suite that was failing (the BF linked to this ticket). In both cases, the cache pressure calculation issues were unearthed by adding extra logging about what the score was when accessed. The score fluctuates too much out of our control to get a more direct repro. I believe Keith is familiar with how the lookaside score operates. First, the score can just sit at 60, say, and never reach 100 to trigger eviction. Second, we don't reset it after eviction, so again we get stuck even after cache pressure recedes. MongoDB also has to build something specialized on top of whatever WT can provide us, so there isn't really some workload that doesn't work now and then with a WT change would start working. Unless we built something on top of WT first while knowing what WT was going to build. |
| Comment by Dianna Hohensee (Inactive) [ 17/May/19 ] |
|
alexander.gorrod I think a statistic reporting the actual percent cache usage would be useful. Then looking at that, and maybe the eviction thresholds, we could more finely control the history window size between stable and oldest timestamps so as not to cause cache pressure – or at least signal via logging that the user needs more cache space. I think we have something that will work for v4.2 – unless I hear otherwise from sharding or drivers. I wouldn't want any WT work to be done, however, unless we had a plan for how to use it to better control the history window. Particularly with the cache changes WT is introducing in v4.4 for longer running transactions. |
| Comment by Dianna Hohensee (Inactive) [ 17/May/19 ] |
|
bruce.lucas@mongodb.com, I removed two of the serverStatus.wiredtiger.snapshot-window-settings fields and added two new fields in this patch. The snapshot-window-settings section was introduced back in |
| Comment by Githook User [ 17/May/19 ] |
|
Author: {'name': 'Dianna', 'email': 'dianna.hohensee@10gen.com', 'username': 'DiannaHohensee'}Message: |
| Comment by Alexander Gorrod [ 15/May/19 ] |
|
dianna.hohensee The intent of the cache lookaside score is that it's an indicator for cache pressure triggered by history requirements. If MongoDB is encountering cases where the score isn't effectively tracking that situation, I'd prefer to update WiredTiger to improve the lookaside score calculation than to search for a solution based on different heuristics. Could you provide a workload or set of workloads where the lookaside score isn't currently behaving as desired so we can understand why and improve it on the WiredTiger side? |
| Comment by Dianna Hohensee (Inactive) [ 14/May/19 ] |
|
louis.williams I imagine the score could also just stay at 60, say, and never reach 100. In that case, we would similarly be stuck. The metric is not reliable for our purposes. |
| Comment by Louis Williams [ 14/May/19 ] |
|
alexander.gorrod the issue described by |
| Comment by Alexander Gorrod [ 10/May/19 ] |
|
The reason I would recommend using WT_STAT_CONN_CACHE_LOOKASIDE_SCORE is that it is a leading indicator - it should grow high as it becomes more likely for WiredTiger to begin using cache overflow. Wheras WT_STAT_CONN_CACHE_LOOKASIDE_INSERT will only have a meaningful result once we have already started using lookaside. The insert statistic is also a counter - so you'll need to track change over time and calculate a running insert rate. I wouldn't recommend using WT_STAT_CONN_CACHE_LOOKASIDE_ENTRIES - that could be heavily skewed due to earlier activity, which I don't think would be ideal. If the lookaside score isn't useful, I'd recommend implementing something that uses a combination of checking whether timestamps are pinned (there are WiredTiger statistics that can tell you that) along with dirty cache usage as a proportion of allowed dirty cache. It's possible to use the cache_bytes_dirty statistic to figure out what proportion of the cache is dirty by comparing it to the configured maximum cache size and the proportion of that which can be dirty which is controlled by the eviction_dirty_trigger configuration setting. |