[SERVER-47855] Change default value for minSnapshotHistoryWindowInSeconds to 5 minutes Created: 30/Apr/20  Updated: 12/Jan/24  Resolved: 19/May/21

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 5.0.0-rc0, 5.1.0-rc0

Type: Task Priority: Major - P3
Reporter: Lingzhi Deng Assignee: Monica Ng
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-56302 Catalog cache refresh can fail with S... Closed
Problem/Incident
Related
related to SERVER-47672 Add minSnapshotHistoryWindowInSeconds... Closed
related to SERVER-55023 Allow minSnapshotHistoryWindowInSecon... Closed
is related to SERVER-24949 Lower WiredTiger idle handle timeout ... Closed
Backwards Compatibility: Minor Change
Backport Requested:
v5.0
Sprint: Repl 2020-05-18, Storage - Ra 2021-03-08, Storage - Ra 2021-03-22, Storage - Ra 2021-04-05, Storage - Ra 2021-04-19, Storage - Ra 2021-05-17, Storage - Ra 2021-05-31
Participants:
Case:
Linked BF Score: 15
Story Points: 0

 Description   

To determine and recommend a good default for minSnapshotHistoryWindowInSeconds. We should also use this issue to understand the behavioral impacts on the system better when increasing this window and a potential max window.



 Comments   
Comment by Githook User [ 20/May/21 ]

Author:

{'name': 'Monica Ng', 'email': 'monica.ng@mongodb.com', 'username': 'mm-ng'}

Message: SERVER-47855 Change default value for minSnapshotHistoryWindowInSeconds

(cherry picked from commit 6a1c644c1166906797f7728e16962cd90d5e14a7)
Branch: v5.0
https://github.com/mongodb/mongo/commit/75a9401252e609dc9f91f49d985d0686cb41b0ef

Comment by Jordi Serra Torrens [ 19/May/21 ]

alexander.gorrod thanks for the heads up. A 5 minute window seems enough for us to close SERVER-56302

Comment by Githook User [ 19/May/21 ]

Author:

{'name': 'Monica Ng', 'email': 'monica.ng@mongodb.com', 'username': 'mm-ng'}

Message: SERVER-47855 Change default value for minSnapshotHistoryWindowInSeconds
Branch: master
https://github.com/mongodb/mongo/commit/64e63f9f0a77a0e4618982663cc003a53f088997

Comment by Monica Ng [ 18/May/21 ]

PR to change the default value to 5 minutes: https://mongodbcr.appspot.com/769320002/

MongoDB Patch: https://spruce.mongodb.com/version/60a31123e3c331512efdea92/tasks

Sys Perf Patch: https://spruce.mongodb.com/version/60a311819ccd4e7800aa3cdd/tasks

Comment by A. Jesse Jiryu Davis [ 17/May/21 ]

I'm curious what the justification is for a 5-minute default. If there's little perf impact from increasing the window from 5 minutes to 30, why not make it 30? The advantage of a longer window is supporting longer-running snapshot reads, e.g. in big analytics jobs.

Comment by Alexander Gorrod [ 17/May/21 ]

We have buy-in across the product and Atlas organizations that a 5 minute window is the right value. This ticket can proceed to code review now.

kaloian.manassiev and jordi.serra-torrens please let us know if you need anything more to unblock SERVER-56302

Comment by Alexander Gorrod [ 13/May/21 ]

tl;dr

The sys-perf workloads show very little throughput or latency cost when extending the window of history from 5 seconds to 30 minutes. The results from this analysis don't materially influence what the best window of history to choose for MongoDB users is. Especially with the currently considered default windows of between 5 and 10 minutes.

Detailed Analysis

Supporting evidence from analyzing the performance regressions in our sys-perf automated performance testing workloads follows:

There are very few throughput or latency differences between 60 and 1800 second windows of history. There are two variants of a Genny workload, which creates a collection, starts 100 threads each thread operating on a single document in the collection. Either alternatively inserting/removing the document or updating it. There is no obvious bottleneck or significant history storage requirement from those workload. I suspect there is contention on database resources across the 100 threads and durable history introduces a (relatively small) latency cost which is being exacerbated by the contentious workload.

When running the Linkbench benchmark, a particular metric (add node) shows a ~20% performance regression for any window greater than 60 seconds. This cost can be explained by the additional work done in WiredTiger to store version information - the benchmark appears to be I/O bound, and adding in additional I/O reduces throughput. It's worth noting that this metric is one of many tracked by Linkbench - not all metrics experience regressions.

There is a particular benchmark that repeatedly does tiny updates to a very large document. That benchmark experiences a 20% throughput regression when extending the window of history. We may in the future do work in WiredTiger to mitigate that cost (it's related to writing content back to disk, and the cost tradeoff between CPU overhead of reading/writing delta encoded updates vs the I/O overhead of reading/writing full values.

Wrapping up

Only the Linkbench regression appears to be intrinsic to the storage of history (since history adds I/O needs to an I/O bound workload). It should be possible to close the performance gap for the other measured workloads with a standard analysis and optimization process if the access patterns tested are relevant to end users.

Comment by Alexander Gorrod [ 07/May/21 ]

It's time for an update here - I have been digging into the performance regressions that have been captured. There are broadly three regressions captured by our performance testing. In short those are:

Simple workloads that insert/remove or repeatedly update a small number of documents and are sensitive to the latency of individual operations can experience a increase in operation latency. That performance penalty seems to be further exacerbated when enabling higher levels of durability (write concerns of 2 and 3) - though it's not clear how/why that would be tied to durable history.

Repeatedly updating one or a small number of large documents has a throughput regression of up to 20% with our recommended setting.

Several of our performance tests report longer inter-test quiesce periods with extended history windows. I have not dug deeply into that behavior - our performance suite isn't designed to measure inter-test quiesce periods as a metric, so the comparison is unlikely to be fair.

Comment by Alexander Gorrod [ 30/Apr/21 ]

We have been analysing the performance regressions experienced when configuring different default windows of history on our automated tests.

I will add a summary of that analysis to this ticket early next week.

We are in the process of choosing a default time, the results of the performance analysis will likely guide us to choose something between 5 and 10 minutes.

Comment by A. Jesse Jiryu Davis [ 22/Jan/21 ]
  • A cluster-wide command sounds useful.
  • No restart required, it's adjustable at runtime. See the snapshot_history_window.js test.
  • A snapshot query outside the window fails with "SnapshotTooOld".
  • Users can choose a snapshot timestamp, or let mongod/mongos choose one for them. Snapshot query replies include the chosen timestamp, so an application could start an analytics job with a query that has readConcern level snapshot but no timestamp, record the server's chosen timestamp, and explicitly use that timestamp in all subsequent queries in the job.
    • Yes, mongos establishes the timestamp and passes it to the shards.
    • See "Technical Design: Snapshot Reads on Secondaries" for the mongod/mongos logic for choosing a timestamp. mongod chooses the majority-committed timestamp. For implementation reasons, mongos is different, it chooses the latest known timestamp. In either case, queries on secondaries work correctly. A query on a stale secondary will wait for the secondary to catch up to the chosen timestamp.
Comment by Daniel Pasette (Inactive) [ 22/Jan/21 ]

I think jesse's point hereĀ about setting a constant amount of window is sound, though I'm still not quite clear what the user experience will feel like. These questions may already be answered but i didn't find them stated in the product description or initiative plan.

  • Users cannot adjust the window cluster wide with a single cluster-wide command. They'd have to set it on each mongod. Could we use the cluster-wide writeConcern machinery to make it a single cmd?
  • Would it require a restart to change the window or can they do it on a running mongod?
  • What's the failure mode when a snapshot query cannot be satisfied due to window size limitations?
  • Will users be allowed to specify a snapshot time, or will the snapshot time be set when the query is received by the server?
    • In sharded clusters, will the snapshot time be established by the mongos and then passed to all mongods?
Comment by Brian Lane [ 21/Sep/20 ]

As discussed - going to park this in our backlog while we work on PM-1844 to see what improvements we may be able to get there first.

Comment by Tess Avitabile (Inactive) [ 20/Aug/20 ]

Great, thanks, brian.lane!

Comment by Brian Lane [ 19/Aug/20 ]

Hi tess.avitabile,

Alex did ping me in the write-up. I will assign this issue to myself and will be chatting to evin.roesle about this.

Thanks!

Comment by Tess Avitabile (Inactive) [ 19/Aug/20 ]

alexander.gorrod, we discussed that we'd request brian.lane to lead the investigation on how to create user guidelines for setting the amount of history to store, as well as what the default should be. Does that still sound okay? Should I assign this ticket to Brian to track that work?

Comment by Alexander Gorrod [ 21/May/20 ]

Thanks for further experimenting lingzhi.deng, and for the write up. I would like to follow up on this conversation in detail, but am busy right now with coordinating changes for the 4.4 release.

My feeling here is that the performance changes you are seeing due to increasing the default time window are in line with what I would have expected. Requiring the storage engine to store 60 seconds of data will mean that it will need to write version information to data files (since it won't all fit in cache until that version information isn't relevant). On top of that, for update workloads, the storage engine will likely be saving multiple different versions of documents to data files as well.

The goal of the durable history work in the last release was to make that cost reasonable when compared to the earlier cache overflow mechanism. I believe your results show that has been successful - most benchmark results are showing less than a 20% regression when requiring the additional history to be kept. We will hopefully be able to reduce that overhead as we spend time tuning the durable history mechanism after the 4.4 release is finalized, but there will be a cost with a lower bound that can be calculated in terms of additional disk space, I/O and CPU associated with keeping a longer window of history.

Comment by Alexander Gorrod [ 06/May/20 ]

lingzhi.deng Thanks for putting the numbers into a digestible form. I took a look and the numbers aren't surprising to me on first look. They seem to vary between 80-90% of the prior performance when configuring a 60 second window. With an outlier at 60% and one at 100%.

I think it would require more digging into the particular regression and the particular workload before determining exactly what is expected behavior. It's also probably worth waiting for some of the performance tickets that are currently in flight in WiredTiger before making a call. There is still some low hanging fruit in terms of getting better performance.

Comment by A. Jesse Jiryu Davis [ 05/May/20 ]

Let's try hard to make the window a configurable constant instead of dynamically adjusted. In performance-sensitive applications it's better to be slow than unpredictable. (Fast and predictable is best, of course.) If a MongoDB deployment is running close to 100% capacity and a snapshot read causes its window to dynamically grow and its performance to decrease, that could cause an outage. I think customers would prefer to control the window size.

Generated at Thu Feb 08 05:15:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.