[SERVER-48395] Extended stalls during heavy insert workload Created: 25/May/20 Updated: 29/Oct/23 Resolved: 15/Jul/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | 4.4.0-rc6, 4.4.0-rc7 |
| Fix Version/s: | 4.4.0-rc13 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Sulabh Mahajan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | KP44 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||
| Sprint: | Storage - Ra 2020-06-15, Storage - Ra 2020-06-29, Storage - Ra 2020-07-13, Storage - Ra 2020-07-27 | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||||||||||
| Story Points: | 5 | ||||||||||||||||||||||||||||||||
| Description |
|
While working with the repro for
FTDC, logs, and repro script attached. The repro creates two collections of 5 GB each with a 5 GB cache, using 50 client threads on a machine with 24 cpus. |
| Comments |
| Comment by Bruce Lucas (Inactive) [ 15/Jul/20 ] |
|
Yes, I can confirm that the stalls are gone. Thanks! |
| Comment by Sulabh Mahajan [ 13/Jul/20 ] |
|
I re-ran the test today on RC13, and see no stalls. If Bruce confirms that this issue is fixed, we will close this ticket. |
| Comment by Sulabh Mahajan [ 12/Jul/20 ] |
|
bruce.lucas, the issues that we identified with this ticket and with |
| Comment by Alex Cameron (Inactive) [ 26/Jun/20 ] |
|
I wasn't able to find anything from our usage of the last_eviction_timestamp. I had a quick look at what we're doing with last_eviction_timestamp since that's the check that is causing is trouble. We're using it the same was as in v4.2 so the issue isn't obvious. I had a bit of a suspicion that it was somehow related to the problem of not pinning ids for history store cursors (even though this is saving the pinned timestamp). I tried running our repro script even with my patch on top of |
| Comment by Sulabh Mahajan [ 25/Jun/20 ] |
|
Copying the comment from Update: The issue here seems to be a manifestation of the same cause as
I am still working on figuring out a few things here:
There are a lot of moving parts here because the stall generally happens around 4% dirty, which is less than when eviction starts (5%). As we stall and fail to evict (pinned transactions) the dirty content rises and crosses 5% which further triggers a different behaviour of eviction. Trying to catch the stall before the eviction target of 5% is reached has been challenging. We are also working on a standalone wiredtiger workload to stimulate the issue, one which has less moving parts and can be controlled easily to debug the problem. |
| Comment by Brian Lane [ 18/Jun/20 ] |
|
Thanks Sulabh for investigating. I agree we should fix this, but not block 4.4 GA on it. I am ok to see this one moved to 4.5 required. |
| Comment by Alexander Gorrod [ 17/Jun/20 ] |
|
To characterize this issue somewhat. The test program has 50 threads inserting batches of 10k empty documents in parallel. That is a worst case scenario for testing how well WiredTiger can get exclusive access to hot pages in append workloads. We have code to efficiently split the front of a busy page and allow new updates to come into that page while we clean up the old page. If there is enough activity remaining on the old page (which can happen when there are lots of threads contending to append), we can end up in a situation where those threads that are still updating the previous page wait for it to be force evicted, and then wait for their part of it to be read back in before being able to proceed. It's something we need to understand and address, but it's not something MongoDB users rely on at the scale used by this test as is indicated by the problem being very significantly mitigated when the oplog is enabled. I think we need to fix this, but after talking it through with sulabh.mahajan today, I'm not sure it should be a 4.4.0 release blocker. bruce.lucas or kelsey.schubert, pasette and brian.lane. Could you please let me know if you agree? |
| Comment by Sulabh Mahajan [ 17/Jun/20 ] |
|
I ran the same test with oplog enabled. I ran 4.2 to compare as well. The problem goes away when the node is a part of replica set. We still have some dip in insert performance, but that happens during a checkpoint, not before it. The same dip is also observed in 4.2, so it is not a new problem in 4.4. The magnitude of dip might be larger than 4.2, but the test is not designed to be conclusive in that observation. Why we have stalls without oplog:
|
| Comment by Sulabh Mahajan [ 16/Jun/20 ] |
In 4.2, we do get to a similarly huge page by the time we evict. We do not see stalls or a correlation to the forced eviction / in-memory page splits. I have tried disabling force evicts in 4.4, but I think I have not disabled all the paths to force evict and I still can't get away with stalls.
Will check that tomorrow. |
| Comment by Alexander Gorrod [ 16/Jun/20 ] |
|
I suspect what is happening here is that during checkpoints, we know we can't force evict the page so we don't bother trying (that's the call to _wt_btree_can_evict_dirty in _wt_page_can_evict). So during checkpoint we allow a lot of updates to come into the page, and the page grows to be large. Once checkpoint finishes, we keep trying to force evict the page, but aren't being successful. The attempts to force evict the page are causing throughput to slow down significantly. We should figure out if something has changed here in terms of effectively getting exclusive access to the page when we start wanting it. |
| Comment by Alexander Gorrod [ 16/Jun/20 ] |
|
That's interesting - can you tell whether the maximum page size at eviction was growing as large with 4.2? |
| Comment by Sulabh Mahajan [ 16/Jun/20 ] |
|
Update:
The above theory is not conclusive yet, I am still working on the ticket and will post more updates |
| Comment by Bruce Lucas (Inactive) [ 29/May/20 ] |
|
Thanks for looking at this.
Not so sure about that - the checkpoints are happening regularly with exactly the expected 60 seconds between the end of one and the start of the next. Also without the checkpoints the stalls last much longer. Naively, it's as if the start of the checkpoint is somehow interrupting the the slow reconciliation process that you describe. I seem to recall seeing this behavior before, where a stall ended with an on-schedule checkpoint, e.g. |
| Comment by Alexander Gorrod [ 29/May/20 ] |
I think it's probably the other way around - checkpoints are waiting for a reconciliation to finish before they can get started.
It's just a theory, but I think what is going on here, is that the workload has 50 threads all contending to insert things at the head of a collection. We have historically had some troubles maintaining throughput in this workload, and have a set of heuristics related to getting exclusive access to a hot page, splitting content off it, and allowing the appends to continue. We start trying to get exclusive access to do what we term an in-memory split when a page gets to 8MB (80% of memory_page_max) in cache. If we haven't succeeded in getting exclusive access when the page reaches 10MB we get more and more assertive (disrupting application threads usual control flow) to try to get exclusive access. If it's reached that threshold we kind of give up on in-memory splits and just reconcile the page. In this workload, by the time we get exclusive access to the page it's grown to 70MB. It takes a long time for the thread reconciling that page to process all the content - even more because we've got some inefficiencies at the moment about how reconciliation checks for history that needs to be cleaned up (e.g: We've also changed some of the restrictions about when a page can be evicted and what happens to it. It used to be that we didn't allow eviction of a page with updates in the range of active transaction IDs unless the cache overflow mechanism was in use. With durable history we've relaxed some of that constraint, and it's probably hurting here. tl;dr: The stalls here have hallmarks of multi-threaded append contention, the symptom is exacerbated by some inefficiencies in reconciliation and eviction scheduling changes that are associated with durable history. We are working on the reconciliation inefficiencies in |
| Comment by Bruce Lucas (Inactive) [ 26/May/20 ] |
|
michael.cahill, no, I reran this (the insert portion of that test only) against rc6. Looks like I forgot to attach ftdc and logs; attached now. |
| Comment by Michael Cahill (Inactive) [ 25/May/20 ] |
|
bruce.lucas, was this against 4.4.0-rc3, as per your most recent comment in |