[SERVER-24949] Lower WiredTiger idle handle timeout to 10 minutes Created: 08/Jul/16 Updated: 21/Nov/23 Resolved: 27/Apr/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 5.0.0-rc0 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Alexander Gorrod | Assignee: | Louis Williams |
| Resolution: | Done | Votes: | 0 |
| Labels: | 3.7BackgroundTask | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Sprint: | Execution Team 2021-05-03 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||
| Description |
|
The MongoDB storage layer currently configures WiredTiger to keep idle collection and index handles open for 28 hours after the last use. We've seen cases where that leads to the WiredTiger handle list growing very large unnecessarily, which can introduce performance problems. We should consider the consequences of reducing that time closer to the default value of 30 seconds. |
| Comments |
| Comment by Louis Williams [ 05/May/21 ] | |||||||||||
|
This makes sense to me. I opened | |||||||||||
| Comment by Alexander Gorrod [ 03/May/21 ] | |||||||||||
I don't think that's the issue. Most of the performance issues we've encountered are due to the WiredTiger handle structures.
Yep - we could do that, or stage flushing content from the cache. Both of those changes would require work in WiredTiger, but neither are particularly daunting. aka: I think we can solve this if it's still an issue in the field.
That's OK with me. I'm not sure what the right number of handles is. ~80 collections with an additional index each seems reasonable to me, but others here have more experience about what sort of distribution of collections are likely in workloads that could be sensitive to this. | |||||||||||
| Comment by Bruce Lucas (Inactive) [ 01/May/21 ] | |||||||||||
|
Agree it seems to make sense to increase close_handle_minimum. Another thing that might lessen the impact is to preferentially close the handles with the smallest amount of data in cache. Not sure how expensive it would be to do this as it maybe requires sorting the handle list, or maybe there is some heuristic way to do it that would be good enough? However if the issue is the number of file descriptors would it be possible to just close the file descriptors and not evict the data from cache (i.e. keep the btree and/or handle)? I would think aggressively closing file descriptors would have less performance impact than aggressively removing btrees as re-opening file descriptors should be quick. | |||||||||||
| Comment by Eric Milkie [ 30/Apr/21 ] | |||||||||||
|
I think we should still investigate increasing close_handle_minimum in conjunction with this change, as it has the potential to lessen the undesirable effect of evicting tables that have periodic workloads, but still reduce high numbers of open file handles in general. | |||||||||||
| Comment by Louis Williams [ 30/Apr/21 ] | |||||||||||
|
It seems like there are two competing interests here:
The problem here is that we don't really have any insight into which workload is more common. I've discussed this with Alex and we think that it's worth trying out this change to better default support durable history. In the event that this causes problems for customers, we have a way out, either by reverting or manually changing the parameter on a per-customer basis. Keep in mind that we already do this for customers where the default timeout is problematic. | |||||||||||
| Comment by Daniel Pasette (Inactive) [ 28/Apr/21 ] | |||||||||||
|
If we're going to make this change, I think it would be wise to increase the close_handle_minimum (is that the correct setting?) significantly as well. I don't see what the harm of keeping idle collections in cache if there's no is no other cache pressure. I'm talking about the case where you have lots of collections (or collections with many indexes) and your workload quiesces at night, but then you have to pay to page all the data back in the next day. I agree it's not the most common case, but I do recall support issues for this case, and it seems to me that this change will impact them. If they're on atlas, I don't think there's any way they can tweak a knob to change this behavior. | |||||||||||
| Comment by Alexander Gorrod [ 28/Apr/21 ] | |||||||||||
|
pasette we have not done any work to make it cheaper to close out handles that hold a lot of pages in cache. On the other hand - getting into that situation I think takes some careful construction:
In short an application needs to have a significant number of active collections, but not be generating meaningful cache pressure due to operations on those collections. It then must have a collection that was being actively used (hence content in cache), but went idle for 10 minutes - after exactly 10 minutes passes, the application wants to use the collection again and needs to wait. Most of the reports associated with It is possible for users to encounter this behavior, but doesn't seem likely. If we notice it in the field we can review how sweep works and ensure that the blocking period of closing out idle handles isn't too long. | |||||||||||
| Comment by Louis Williams [ 27/Apr/21 ] | |||||||||||
|
The default WiredTiger idle handle timeout has been lowered to 10 minutes from 27 hours. This may result in performance changes in applications with many collections and workloads where collections are idle for longer than 10 minutes. This parameter is still configurable with the setParameter "wiredTigerFileHandleCloseIdleTime". | |||||||||||
| Comment by Githook User [ 27/Apr/21 ] | |||||||||||
|
Author: {'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}Message: | |||||||||||
| Comment by Daniel Pasette (Inactive) [ 22/Apr/21 ] | |||||||||||
|
Yes, i do remember, and you've captured the issue. I believe it tracks back to around this issue (and the linked issues from it): | |||||||||||
| Comment by Bruce Lucas (Inactive) [ 22/Apr/21 ] | |||||||||||
|
I think any change like this is likely to produce some very surprising results for some applications. The original choice of just over 24 hours was partly motivated IIRC by some customers who encountered nasty performance surprises when their load that had a strong daily cycle and was largely idle overnight came back on line at 8 AM and suddenly had to rewarm the cache even though their working set fit in cache and was consistent from day to day. pasette I think you were involved - do you recall the cases I'm talking about? | |||||||||||
| Comment by Eric Milkie [ 22/Apr/21 ] | |||||||||||
|
Because this will result in flushing all cached pages for files that are idle longer than 10 minutes, we should be prepared for some workloads to have performance changes due to this. In particular, very idle databases might see a negative impact on the latency of all read queries. | |||||||||||
| Comment by Alexander Gorrod [ 15/Mar/21 ] | |||||||||||
|
We came across a case where the default chosen here is harmful to applications - in The reason this happens is that MongoDB doesn't drop collections until they are no longer required for the snapshot history window, extending that window means collections aren't dropped. That in combination with keeping idle handles cached for at least 27 hours means that a lot of active handles can be accumulated in such a workload now. We should reduce the idle timeout for handles. My recommendation would be to reduce the idle timeout to 10 minutes, the change would look something like:
The value was originally set so high because there was a performance test which sat idle for an extended period of time (multiple hours) between phases. A consequence of closing idle handles is that any content is flushed from the cache, so better performance is observed since keeping the handle open across idle periods meant less cache warming is required. That's not a common pattern, and we have seen issues in MongoDB deployments with many live (though inactive) handles over a number of years now. Further information about the behavior can be seen in | |||||||||||
| Comment by Alexander Gorrod [ 08/Jul/16 ] | |||||||||||
|
The current source code has the following comment:
The original change was made in |