[SERVER-36221] [3.6] Performance regression on small updates to large documents Created: 20/Jul/18  Updated: 06/Dec/22  Resolved: 28/Feb/19

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 3.6.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Backlog - Storage Execution Team
Resolution: Duplicate Votes: 0
Labels: dmd-perf
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File erratic-1.png     File erratic-1.tar     PNG File evictions.png     PNG File four-erratic.png     PNG File four.png     PNG File regression.png     File repro-10MBx2-erratic-1.sh     File repro-10MBx2-regression.sh     Text File stacks-regression.txt    
Issue Links:
Backports
Depends
depends on WT-4483 Improve caching of small updates to l... Closed
Duplicate
duplicates WT-4483 Improve caching of small updates to l... Closed
Related
related to SERVER-35958 Big CPU load increase (×4) on seconda... Closed
is related to SERVER-31679 Increase in disk i/o for writes to re... Closed
is related to SERVER-33191 Cache-full hangs on 3.6 Closed
is related to SERVER-36373 create test to fill WT cache during s... Closed
is related to WT-4105 Optimize cache usage for update workl... Closed
Assigned Teams:
Storage Execution
Operating System: ALL
Sprint: Storage NYC 2018-07-30, Storage NYC 2018-08-13, Storage NYC 2018-08-27, Storage NYC 2018-09-10, Storage NYC 2018-10-22, Storage NYC 2018-11-05
Participants:
Linked BF Score: 0

 Description   

The update portion of the repro script repro-10MBx2.sh from SERVER-33191 shows ~10x regression between 3.6.5 (left) and 3.6.6 (right).

This seems to be related to eviction of large pages. Interestingly even though we are doing only about 9 updates per second, we are doing 170 evictions per second.



 Comments   
Comment by Kelsey Schubert [ 28/Feb/19 ]

This regression has no longer reproduces after the changes from WT-4483 were backported into the 3.6 branch. This issue will be resolved in MongoDB 3.6.11.

Comment by Bruce Lucas (Inactive) [ 04/Dec/18 ]

Thanks Michael. That seems consistent with the following metrics for four runs (3.6.5, 3.6.6, 3.6.8, 4.0.2).

I think that the "pages evicted because they exceeded the in-memory maximum count" refers to pages evicted due to exceeding the maximum number of deltas as you suggested.

We also see as you suggested the oldest_timestamp being updated less often. Interestingly the rate at which the oldest timestamp is updated is about the same as the rate of log flushes, so perhaps there is something that connects them? The change that caused this regression (SERVER-31679) reduced the rate of log flushes so if that also reduced the rate at which oldest_timestamp is updated that could explain the regression.

In the 4.0.4 run however we see the rate of log flushes is as high as it was in 3.6.5 before SERVER-31679. So if the above analysis is correct that means that the reason we don't see the regression in 4.0 is that SERVER-31679 is for some reason ineffective in 4.0. Maybe the reason is that there are a couple of conditions under which we don't reduce the rate of flushes, and maybe that is more often the case in 4.0 than 3.6?

Comment by Michael Cahill (Inactive) [ 04/Dec/18 ]

brian.lane pinged me about this ticket.

This workload is sensitive to two factors:

  1. how large a page is allowed to grow in cache before it is forcibly evicted (by default this is 10MB so the reproducer creates a page that is permanently over this limit); and
  2. how many delta updates accumulate before making a full copy of the document (default 10, as long as one of the updates is stable, i.e., older than oldest_timestamp).

The WT code for both of these is effectively the same between v3.6 and v4.0, but both are sensitive to how frequently the oldest_timestamp is updated. That's because WT won't retry eviction for a page if oldest_timestamp hasn't change since last time it tried, and because we keep buffering delta changes if the oldest_timestamp hasn't moved forward. Given this, I suspect some difference in replication is involved, and backporting that change may not be desirable in order to address this performance regression.

Instead, we could consider allowing pages containing large documents to grow larger and/or allowing more delta updates to accumulate before making full copies of documents. The former trades larger pages in cache for less frequent eviction, the latter will make some reads slower (because they have to apply more delta updates) in order to make the updates faster.

Comment by Bruce Lucas (Inactive) [ 01/Oct/18 ]

After a little investigation here's some more information (some of this confirms or repeats comments from above);

  • The script stops one of the secondaries, and the regression is much less if one of the secondaries is not stopped. Following table shows time taken for the update phase of the script that shows the regression:

            one secondary    both secondaries
            running          running
     
    3.6.5     28 sec          26 sec
    3.6.6    183 sec          41 sec
    

    I suspect this is because the change in SERVER-31679 only applies if there are no oplog waiters, and with only one secondary running this is a much larger fraction of the time. I believe the same regression would be seen if both secondaries were running but lagged because they would not be waiting for the oplog as often.

  • The regression appears to be the result of doing far more evictions. As noted in the opening comment in 3.6.6 we are doing about 170 evictions per second ("pages evicted because they exceeded the in-memory maximum") even though we are only doing 9 updates per second. Stack traces taken during a run show the same story:

    3.6.5:  4 of 13 (31%) of CmdUpdate are in __wt_page_release_evict
    3.6.6: 41 of 43 (95%) of CmdUpdate are in __wt_page_release_evict
    

So the questions seem to be

  • why does the change made in SERVER-31679 have the side effect of causing far more evictions?
  • why does the same issue not occur in 4.0?
Comment by Bruce Lucas (Inactive) [ 28/Sep/18 ]

A git bisect identified the following commit as the cause of this regression:

a5656bc506fcdfe36d44cfb3c17969cbe35695c0 is the first bad commit
SERVER-31679 Limit journaling frequency without oplog waiters

Comment by Bruce Lucas (Inactive) [ 20/Sep/18 ]

It appears that this regression is fixed in 4.0.2 (but not 3.6.8), and in fact performance is both improved and less erratic, on the two workloads attached to this ticket:

repro-10MBx2-regression.sh

repro-10MBx2-erratic-1.sh

Each chart shows four runs, marked A-D, using 3.6.5, 3.6.6, 3.6.8, and 4.0.2 respectively.

Comment by Benety Goh [ 31/Jul/18 ]

moving code review from SERVER-36221 to SERVER-36373

Comment by Alexander Gorrod [ 26/Jul/18 ]

I'd start by looking at whether WT-4105 caused the change in behavior - it's directly related.

Comment by Bruce Lucas (Inactive) [ 25/Jul/18 ]

Attached is a simpler script, repro-10MBx2-regression.sh, that shows the regression in 3.6.6. Also attached stacks-regression.txt which look like they show the update often in __rec_split_write.

That script stops one of the secondaries while the updates are being applied. Interestingly the regression disappears if the secondary is not stopped. However, depending on factors like document size and number of documents, the throughput under 3.6.6 even with all members of the replica set in operation can be very erratic, e.g.:

3.6.5 on the left, 3.6.6 on the right. Attached repro script repro-10MBx2-erratic-1.sh and ftdc data erratic-1.tar.

Generated at Thu Feb 08 04:42:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.