[SERVER-48395] Extended stalls during heavy insert workload Created: 25/May/20  Updated: 29/Oct/23  Resolved: 15/Jul/20

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 4.4.0-rc6, 4.4.0-rc7
Fix Version/s: 4.4.0-rc13

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Sulabh Mahajan
Resolution: Fixed Votes: 0
Labels: KP44
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 4.2-pic.png     File insert.tgz     File repro.sh     PNG File server-resolved.png     PNG File stall-in-mem-split.png     PNG File stalls.png     PNG File with-oplog-closeup.png     PNG File with-oplog.png    
Issue Links:
Depends
depends on WT-6444 Abort a transaction if it is force ev... Closed
depends on WT-6484 Don't retry eviction if last_eviction... Closed
depends on WT-6488 Fail update restore eviction if no ne... Closed
Related
related to WT-6175 tcmalloc fragmentation is worse in 4.... Closed
related to WT-6444 Abort a transaction if it is force ev... Closed
is related to SERVER-48396 SIGUSR2 stack collection closes conne... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Storage - Ra 2020-06-15, Storage - Ra 2020-06-29, Storage - Ra 2020-07-13, Storage - Ra 2020-07-27
Participants:
Linked BF Score: 0
Story Points: 5

 Description   

While working with the repro for WT-6175 I noticed that there were extended stalls during the insert phase.

  • The stalls seem to end with the start of the next checkpoint
  • With checkpoints disabled the stalls lasted as long as 10 minutes
  • During the stalls the log reports operations that took the entire duration of the stall to complete
  • They appear to have something to do with page splits.

FTDC, logs, and repro script attached. The repro creates two collections of 5 GB each with a 5 GB cache, using 50 client threads on a machine with 24 cpus.



 Comments   
Comment by Bruce Lucas (Inactive) [ 15/Jul/20 ]

Yes, I can confirm that the stalls are gone. Thanks!

Comment by Sulabh Mahajan [ 13/Jul/20 ]

I re-ran the test today on RC13, and see no stalls. If Bruce confirms that this issue is fixed, we will close this ticket.

Comment by Sulabh Mahajan [ 12/Jul/20 ]

bruce.lucas, the issues that we identified with this ticket and with WT-6444 have all been merged into v4.4. Could you please verify that the workload attached to this ticket is not stalling anymore (than 4.2).

Comment by Alex Cameron (Inactive) [ 26/Jun/20 ]

I wasn't able to find anything from our usage of the last_eviction_timestamp. I had a quick look at what we're doing with last_eviction_timestamp since that's the check that is causing is trouble. We're using it the same was as in v4.2 so the issue isn't obvious.

I had a bit of a suspicion that it was somehow related to the problem of not pinning ids for history store cursors (even though this is saving the pinned timestamp). I tried running our repro script even with my patch on top of WT-6462 in https://github.com/wiredtiger/wiredtiger/pull/5858 but still saw the erratic performance.

Comment by Sulabh Mahajan [ 25/Jun/20 ]

Copying the comment from WT-6444:

Update:

The issue here seems to be a manifestation of the same cause as SERVER-48395. I have not root caused it completely yet, but there are some observations so far:

  • The insert workload populates pages in such a way that a small number of pages get all the appends and quickly grow in size
  • Once the page reaches the size that pages are generally split at (80% of the max page, which is 10 MB for collections and 5 MB for indexes), the pages are forcibly evicted by the application threads before they can do more inserts.
  • This is expected and we detect an append workload and try to split such that the last inserts of the page split and go to a new page, where the threads can continue inserting into.
  • I noticed that a lot of the above result in failures because split needs access to the parent page and we fail to get that access for some reason.
  • While we fail to split we still continue putting more inserts and let the pages grow, but they are only allowed to grow to a certain extent. Eventually, we start to see very slow eviction of these new pages, which I believe is the page repeatedly being read and evicted, resulting in a severe slowdown of the workload. While this happens as part of the insert transaction, the transaction gets pinned as well.
  • While we are in this degraded state we are unable to evict, and we repeatedly do 1 - 1 replacement of the pages in memory. I think here we are spending a lot of time trying to evict, but not making any real progress.
  • We have a portion of code that could detect that eviction is not making progress and likely holding the transactions pinned, so we fail and rollback the transaction. The rollback then results in resolving the situation temporarily, till we get into the same degraded state again.

I am still working on figuring out a few things here:

  • Why are we unable to get lock to the parent pages to do more effective forced in-memory splits
  • What leads to that degraded state where eviction ends up replacing page by page without making any real progress.

There are a lot of moving parts here because the stall generally happens around 4% dirty, which is less than when eviction starts (5%). As we stall and fail to evict (pinned transactions) the dirty content rises and crosses 5% which further triggers a different behaviour of eviction. Trying to catch the stall before the eviction target of 5% is reached has been challenging. We are also working on a standalone wiredtiger workload to stimulate the issue, one which has less moving parts and can be controlled easily to debug the problem.

Comment by Brian Lane [ 18/Jun/20 ]

Thanks Sulabh for investigating.

I agree we should fix this, but not block 4.4 GA on it. I am ok to see this one moved to 4.5 required.

Comment by Alexander Gorrod [ 17/Jun/20 ]

To characterize this issue somewhat. The test program has 50 threads inserting batches of 10k empty documents in parallel. That is a worst case scenario for testing how well WiredTiger can get exclusive access to hot pages in append workloads. We have code to efficiently split the front of a busy page and allow new updates to come into that page while we clean up the old page. If there is enough activity remaining on the old page (which can happen when there are lots of threads contending to append), we can end up in a situation where those threads that are still updating the previous page wait for it to be force evicted, and then wait for their part of it to be read back in before being able to proceed.

It's something we need to understand and address, but it's not something MongoDB users rely on at the scale used by this test as is indicated by the problem being very significantly mitigated when the oplog is enabled.

I think we need to fix this, but after talking it through with sulabh.mahajan today, I'm not sure it should be a 4.4.0 release blocker.

bruce.lucas or kelsey.schubert, pasette and brian.lane. Could you please let me know if you agree?

Comment by Sulabh Mahajan [ 17/Jun/20 ]

I ran the same test with oplog enabled. I ran 4.2 to compare as well. The problem goes away when the node is a part of replica set. We still have some dip in insert performance, but that happens during a checkpoint, not before it. The same dip is also observed in 4.2, so it is not a new problem in 4.4. The magnitude of dip might be larger than 4.2, but the test is not designed to be conclusive in that observation.

Why we have stalls without oplog:

  • From statistics the best guess is that we are force-evicting pages more than before. There are new conditions in 4.4 around when a page is considered evictable (some of the content has been moved to history store?) which were not the same as in 4.2
  • Since we evict the pages that the other threads are still populating, the other threads get stalled waiting for the page to be reconciled, evicted and then read back again, bringing the performance down.
  • Since we do not force evict when a checkpoint is going on, the insert rate goes up when a checkpoint starts.
Comment by Sulabh Mahajan [ 16/Jun/20 ]

That's interesting - can you tell whether the maximum page size at eviction was growing as large with 4.2?

In 4.2, we do get to a similarly huge page by the time we evict. We do not see stalls or a correlation to the forced eviction / in-memory page splits.

I have tried disabling force evicts in 4.4, but I think I have not disabled all the paths to force evict and I still can't get away with stalls.

We should figure out if something has changed here in terms of effectively getting exclusive access to the page when we start wanting it.

Will check that tomorrow.

Comment by Alexander Gorrod [ 16/Jun/20 ]

I suspect what is happening here is that during checkpoints, we know we can't force evict the page so we don't bother trying (that's the call to _wt_btree_can_evict_dirty in _wt_page_can_evict). So during checkpoint we allow a lot of updates to come into the page, and the page grows to be large.

Once checkpoint finishes, we keep trying to force evict the page, but aren't being successful. The attempts to force evict the page are causing throughput to slow down significantly. We should figure out if something has changed here in terms of effectively getting exclusive access to the page when we start wanting it.

Comment by Alexander Gorrod [ 16/Jun/20 ]

That's interesting - can you tell whether the maximum page size at eviction was growing as large with 4.2?

Comment by Sulabh Mahajan [ 16/Jun/20 ]

Update:

  • I am able to reproduce with the attached script
  • I have run multiple times with per file FTDC collection as well to analyse data. Sometimes there are stalls not correlating to the checkpoint. Start of the checkpoint "helps" release the stall
  • After spending a lot of time looking at FTDC stats, I think it is because of the pages that are getting the inserts growing large and being forcibly evicted. While being force evicted by the application threads, then the threads are throttled to do more work.

The above theory is not conclusive yet, I am still working on the ticket and will post more updates

Comment by Bruce Lucas (Inactive) [ 29/May/20 ]

Thanks for looking at this.

I think it's probably the other way around - checkpoints are waiting for a reconciliation to finish before they can get started.

Not so sure about that - the checkpoints are happening regularly with exactly the expected 60 seconds between the end of one and the start of the next. Also without the checkpoints the stalls last much longer. Naively, it's as if the start of the checkpoint is somehow interrupting the the slow reconciliation process that you describe.

I seem to recall seeing this behavior before, where a stall ended with an on-schedule checkpoint, e.g. SERVER-16938, also associated with highly concurrent workload.

Comment by Alexander Gorrod [ 29/May/20 ]

The stalls seem to end with the start of the next checkpoint

I think it's probably the other way around - checkpoints are waiting for a reconciliation to finish before they can get started.

With checkpoints disabled the stalls lasted as long as 10 minutes

It's just a theory, but I think what is going on here, is that the workload has 50 threads all contending to insert things at the head of a collection. We have historically had some troubles maintaining throughput in this workload, and have a set of heuristics related to getting exclusive access to a hot page, splitting content off it, and allowing the appends to continue.

We start trying to get exclusive access to do what we term an in-memory split when a page gets to 8MB (80% of memory_page_max) in cache. If we haven't succeeded in getting exclusive access when the page reaches 10MB we get more and more assertive (disrupting application threads usual control flow) to try to get exclusive access. If it's reached that threshold we kind of give up on in-memory splits and just reconcile the page. In this workload, by the time we get exclusive access to the page it's grown to 70MB. It takes a long time for the thread reconciling that page to process all the content - even more because we've got some inefficiencies at the moment about how reconciliation checks for history that needs to be cleaned up (e.g: WT-5769).

We've also changed some of the restrictions about when a page can be evicted and what happens to it. It used to be that we didn't allow eviction of a page with updates in the range of active transaction IDs unless the cache overflow mechanism was in use. With durable history we've relaxed some of that constraint, and it's probably hurting here.

tl;dr: The stalls here have hallmarks of multi-threaded append contention, the symptom is exacerbated by some inefficiencies in reconciliation and eviction scheduling changes that are associated with durable history. We are working on the reconciliation inefficiencies in WT-5769, and can schedule work to look at the scheduling problem if an issue remains.

Comment by Bruce Lucas (Inactive) [ 26/May/20 ]

michael.cahill, no, I reran this (the insert portion of that test only) against rc6. Looks like I forgot to attach ftdc and logs; attached now.

Comment by Michael Cahill (Inactive) [ 25/May/20 ]

bruce.lucas, was this against 4.4.0-rc3, as per your most recent comment in WT-6175?

Generated at Thu Feb 08 05:17:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.