[SERVER-44991] Performance regression in indexes with keys with common prefixes Created: 06/Dec/19  Updated: 29/Oct/23  Resolved: 06/Jan/20

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 4.2.1
Fix Version/s: 4.2.3

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Haribabu Kommi
Resolution: Fixed Votes: 0
Labels: KP42, perf-escapes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Large_page_size_in_4.2.png     PNG File With_4.2_fix_for_prefix_problem.png     PNG File compare.png     PNG File oplog suffix compression.png     PNG File page_max_32K_still_perf_difference.png     File repro.sh     PNG File 屏幕快照 2019-12-17 下午11.02.32.png     PNG File 屏幕快照 2019-12-17 下午4.55.18.png     PNG File 屏幕快照 2019-12-19 上午10.29.17.png    
Issue Links:
Depends
depends on WT-5319 Avoid clearing the saved last-key whe... Closed
Duplicate
is duplicated by SERVER-44752 Big CPU performance regression betwee... Closed
Related
is related to SERVER-44948 High cache pressure in 4.2 replica se... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Storage Engines 2019-12-16, Storage Engines 2019-12-30, Storage Engines 2020-01-13
Participants:
Case:
Story Points: 1

 Description   

Test creates 5 collections with 10 indexes each. Indexes are designed to have keys with substantial common prefixes; this seemed to be important to generate the issue. Collections are populated, then are sparsely updated in parallel, aiming for roughly 1 update per page, in order to generate dirty pages at a high rate with little application work.

Left is 4.0.13, right is 4.2.1. A-B and C-D are the update portions of the test. The performance regression is seen in the timespan for the update portion, and in the average latency.

The rate of pages written to disk and pages evicted is substantially lower in 4.2.1. In this test dirty fill ratio is pegged at 20%, so the rate at which pages can be written to disk is the bottleneck.

The test is run on a 24-CPU machine, so the CPU utilization during both tests is roughly what would be expected with ~5 constantly active application threads, plus 3-4 constantly active eviction threads. But in spite of the same CPU activity for eviction, we are evicting pages at a much lower rate, so we must be using more CPU per page evicted in 4.2.1. Perf shows that this additional CPU activity is accounted for by __wt_row_leaf_key_work.



 Comments   
Comment by Haribabu Kommi [ 06/Jan/20 ]

Thanks for confirming the fix zyd_com. The fix will be available in the next 4.2 minor version.

Comment by Githook User [ 02/Jan/20 ]

Author:

{'name': 'Hari Babu Kommi', 'email': 'haribabu.kommi@mongodb.com', 'username': 'kommiharibabu'}

Message: WT-5319 Avoid clearing the saved last-key when no instantiated key (#5041)

Reconciliation has some fast-path code to work on prefix-compressed
to avoid figuring out each key from scratch which would require
going back some number of K/V pairs to key that's not
prefix-compressed and then rolling forward), we use the complete
key we just built for the last K/V pair's reconciliation, and
apply the current key's prefix-compression change to it, saving
us quite a bit of work. The change in WT-4642 clears that saved,
last-key information, and it leads to a performance drop compared
to 4.0 version. More details are available in SERVER-44991.
Branch: mongodb-4.2
https://github.com/wiredtiger/wiredtiger/commit/cc97c89d0a0f5db433191bf4b8d76c6335649256

Comment by Zhang Youdong [ 19/Dec/19 ]

This patch works for me,thank you!

 

Comment by Githook User [ 13/Dec/19 ]

Author:

{'name': 'Hari Babu Kommi', 'email': 'haribabu.kommi@mongodb.com', 'username': 'kommiharibabu'}

Message: WT-5319 Avoid clearing the saved last-key when no instantiated key (#5041)

Reconciliation has some fast-path code to work on prefix-compressed
to avoid figuring out each key from scratch which would require
going back some number of K/V pairs to key that's not
prefix-compressed and then rolling forward), we use the complete
key we just built for the last K/V pair's reconciliation, and
apply the current key's prefix-compression change to it, saving
us quite a bit of work. The change in WT-4642 clears that saved,
last-key information, and it leads to a performance drop compared
to 4.0 version. More details are available in SERVER-44991.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/cc97c89d0a0f5db433191bf4b8d76c6335649256

Comment by Haribabu Kommi [ 13/Dec/19 ]

keith.bostic, Thanks for the details.

I reverted the first change in the WT-4642 commit and ran the reproducer on top it and it seems the performance is back.
Here I attached the latest FTDC data comparison with the fixed version.

 
 

Comment by Keith Bostic (Inactive) [ 12/Dec/19 ]

Thank you, haribabu.kommi, bruce.lucas, I think you've figured it out.

In short, there's some fast-path code in reconciliation (the same code you flagged, here]). What's happening is we're working our way through the K/V pairs on the page, and they may be prefix compressed. Rather than figure out each key from scratch (which would require going back some number of K/V pairs to key that's not prefix-compressed and then rolling forward), we use the complete key we just built for the last K/V pair's reconciliation, and apply the current key's prefix-compression change to it, saving us quite a bit of work. The change in WT-4642 clears that saved, last-key information, so we do it the hard way every time. It works, but it's slower because we're doing a bunch more work, as you both explained.

We can probably just back out the first change in the WT-4642 commit, and we'll go back to our usual performance in the case of prefix compression.

I want to stare pretty hard at it though, both to ensure we didn't break something else while this change was in place (it was only 8 months ago, so I'm not too worried about that), but also to make sure the compiler's warning that those fields were used in an uninitialized way (the report that resulted in the WT-4642 changes), wasn't correct.

Comment by Haribabu Kommi [ 12/Dec/19 ]

bruce.lucas, Thanks for your further investigation of the issue.

I did a little investigating with gdb to see if I could understand why perf shows __wt_rec_row_leaf behaving differently, i.e

4.2: mostly calls the more expensive __wt_row_leaf_key_copy
4.0: mostly calls the less expensive __rec_cell_build_leaf_key

your above analysis of function calls is absolutely correct. There is quite a number of calls difference between these two functions between 4.0 and 4.2 is the reason behind the performance drop.

It looks like __wt_rec_row_leaf calls __wt_row_leaf_key_info here, and then

in 4.2
that returns false here, so this was case WT_CELL_FLAG, which I think is an uninstantiated prefix-compressed key
so __wt_rec_row_leaf then ends up calling the expensive __wt_row_leaf_key_copy to instantiate it that we see in our perf stacks here

in 4.0
that returns true here, so this was an instantiated key.
so __wt_rec_row_leaf calls the less expensive __rec_cell_build_leaf_key that we see in our perf statcks here

There is a small correction to the above analysis. The WT_CELL_FLAG flag scenario is happening in 4.0.13 version also and _wt_row_leaf_key_info function returns false as it is done for 4.2.1 version, but instead of going to wt_row_leaf_key_copy, it reaches the function _rec_cell_build_leaf_key because of the following code here.

I added counters to find out how many time does the above check makes it into __rec_cell_build_leaf_key function in both 4.0.13 and 4.2.1 versions.

4.2.1 -  Counters of direct to build function: 193072400, based on if check: 0,  direct to copy function: 897334132
4.0.13 -  Counters of direct to build function: 189028633, based on if check: 876993673,  direct to copy function: 4037056

If we observe the counters, most of the "direct to copy function calls" is converted into "build calls" in 4.0.13.

The same "if check code" which I referred above present in both versions 4.0.13 and 4.2. But because of the code here, the output size coming as zero and it leads to condition failure and it always go to __wt_row_leaf_key_copy. The size resetting code was added as part of WT-4642.

Will look into it further of using "size" without resetting it has any problem or not?

Comment by Bruce Lucas (Inactive) [ 11/Dec/19 ]

I did a little investigating with gdb to see if I could understand why perf shows __wt_rec_row_leaf behaving differently, i.e

  • 4.2: mostly calls the more expensive __wt_row_leaf_key_copy
  • 4.0: mostly calls the less expensive __rec_cell_build_leaf_key

It looks like __wt_rec_row_leaf calls __wt_row_leaf_key_info here, and then

  • in 4.2
    • that returns false here, so this was case WT_CELL_FLAG, which I think is an uninstantiated prefix-compressed key
    • so __wt_rec_row_leaf then ends up calling the expensive __wt_row_leaf_key_copy to instantiate it that we see in our perf stacks here
  • in 4.0
    • that returns true here, so this was an instantiated key.
    • so __wt_rec_row_leaf calls the less expensive __rec_cell_build_leaf_key that we see in our perf statcks here

I think this means that when reconciliation is done in 4.0 the keys are already instantiated, whereas in 4.2 they are not. keith.bostic is this a reasonable interpretation of the above? Any idea why this might be the case?

(Note: when following the behavior with gdb you have to be careful to note which wt table the page being reconciled belongs to, as only some of the tables use prefix compression. A simple way to identify them is that they are the set of index-*.wt tables with larger sizes; the smaller index-*.wt tables are the _id indexes which are not of interest here, and the collection-*.wt tables are also not of interest.)

Comment by Haribabu Kommi [ 11/Dec/19 ]

bruce.lucas, Thanks for providing more details that are really helpful. I changed the memory_page_image_max=32k to every collection to eliminate the large page size is the problem for the performance drop, but as you said earlier, the performance doesn't improve even the number of pages are same in both 4.0 and 4.2 versions.

Upon further investigating the problem, we believe that WT-4489 may be the reason behind the performance drop. I ran the custom-built MongoDB using the WiredTiger version drop before WT-4489 merge into MongoDB, I find that is not the major function __wt_row_leaf_key_work in the perf sampling report. To find out whether the added extra members lead to an increase in cache misses, but it turns out to be cache misses are not the problem.

4.0.13

       935711.027783      task-clock (msec)         #    5.048 CPUs utilized          
 2,339,806,753,771      cycles                              #    2.501 GHz                    
 3,419,184,650,000      instructions                    #    1.46  insn per cycle         
        11,780,087,731      cache-references          #   12.589 M/sec                  
         5,259,751,797      cache-misses                #   44.650 % of all cache refs    
 
     185.379422516 seconds time elapsed

4.2.1

     1357463.717421      task-clock (msec)         #    5.439 CPUs utilized          
 3,404,341,758,478      cycles                             #    2.508 GHz                    
 6,706,089,784,541      instructions                   #    1.97  insn per cycle         
      13,096,446,382      cache-references         #    9.648 M/sec                  
        5,437,768,929      cache-misses                #   41.521 % of all cache refs    
 
     249.566129495 seconds time elapsed

I will continue further to find out the root cause of WT-4489 changes.

Comment by Bruce Lucas (Inactive) [ 10/Dec/19 ]

That's a good observation and could be an issue for some workloads, and is worth looking into, but I don't think that's what's going on here.

If the density of updates is <1 per 4.0 page, then indeed larger 4.2 pages means that more bytes will be reconciled and written to disk per update, and that could impact performance for workloads that do sparse updates.

However in this case the density of updates is ~1 or more per 4.0 page, so while the 4.2 pages are larger, each page will contain more updates when it is written, so the bytes reconciled and written per update will be about the same. In that case I would think the cost should be about the same.

We can look at the WT pages and bytes written from cache metrics via db.c.stats({indexDetails: true}) to confirm this and get more detail about the page sizes.

                       bytes written    pages written       bytes/page
                         from cache       from cache
 
4.0
collection table       3,736,137,272       131,096           28,499
10x index tables       2,187,180,200       169,970           12,868
 
4.2
collection table       3,763,090,764        32,135          117,103
10x index tables       2,197,278,800       171,100           12,842

This confirms that the total number of bytes reconciled and written is the same in 4.2, so shouldn't the cost be the same?

It also confirms that bytes/page is about 4x larger in 4.2, but only for the collection table - the index tables have the same page sizes in 4.2. As far as I've been able to tell the extra CPU time we observed in perf is related to prefix compression, which is only enabled for the index tables, and their page sizes haven't changed.

Comment by Haribabu Kommi [ 10/Dec/19 ]

During the analysis of the FTDC data, we found out that cache byte belongs to page images are almost same for both 4.2 and 4.0 versions but there is a difference in the number of pages held by the cache compared to 4.0.13, this difference feels like that the page size in 4.2 seems to be larger than 4.0.13 can cause performance problem during the reconciliation.

Currently, I am investigating the reason behind large page sizes in 4.2.

Comment by Bruce Lucas (Inactive) [ 09/Dec/19 ]

To summarize offline discussion with Keith:

Those numbers are actually perf sample counts, and perf sampling will tell you number of calls times cost of each call, so inconclusive by itself as to the number of calls.

But in this case perf shows us that the time in __wt_rec_row_leaf is largely __wt_row_leaf_key_work in 4.2 but largely __rec_cell_build_leaf_key in 4.0, and we know that __wt_rec_row_leaf makes a decision between those two calls, so it seems pretty likely to me that it’s making a different decision between 4.0 and 4.2, resulting in (a lot) more calls to the expensive __wt_rec_row_leaf in 4.2.

Comment by Keith Bostic (Inactive) [ 09/Dec/19 ]

Interestingly, the time spent in __wt_row_leaf_key_work called from __wt_row_search only increased marginally in 4.2, whereas the time spent in __wt_row_leaf_key_work called from eviction increased from ~0 to half our cycles. Could it be that __wt_row_leaf_key_work isn't much more expensive, but for some reason we are calling it during eviction far more often?

I can't think of any reason this would be the case, and I don't see any significant changes that would imply it.

haribabu.kommi, can you instrument your reproducer and tell us if we're calling __wt_row_leaf_key_work from eviction more often than before?

 

Edit:

bruce.lucas notes the leading numbers are call counts, so we know there was a change in total calls.

Comment by Bruce Lucas (Inactive) [ 09/Dec/19 ]

There's no useful stats on prefix compression in ftdc, but you can check the stats for a particular collection using e.g. db.c1.stats({indexDetails:true}). Doing that on the repro shows that we are as expected getting significant benefit from prefix compression in the repro: index checkpoint size is ~10 MB, after building the index "leaf page key bytes discarded using prefix compression" is about 200 MB, and after doing the updates "leaf page key bytes discarded using prefix compression" is about 500 MB.

Those stats are about the same for 4.0 and 4.2, so the difference in behavior of __wt_rec_row_leaf mentioned above doesn't seem to be due to any difference in prefix compression.

Comment by Bruce Lucas (Inactive) [ 09/Dec/19 ]

The top stack traces from perf for eviction are very different between 4.2 and 4.0, and support the possibility that the problem is not in __wt_row_leaf_key_work but rather in how often we call it.

4.0

1813 ...;__wt_evict;__wt_reconcile;__wt_rec_row_leaf;__rec_cell_build_leaf_key
2994 ...;__wt_evict;__wt_reconcile;__wt_rec_row_leaf

4.2

7295 ...;__wt_evict;__wt_reconcile;__wt_rec_row_leaf;__wt_row_leaf_key_copy;__wt_row_leaf_key_work

In other words, in 4.0 we almost never call __wt_row_leaf_key_copy from __wt_rec_row_leaf, but instead call the (apparently more efficient) __rec_cell_build_leaf_key.

Comment by Bruce Lucas (Inactive) [ 09/Dec/19 ]

Regarding priority, I think it will be a significant problem for an unknown fraction of our users. For many of those affected I think it will likely prevent a successful upgrade to 4.2.

Comment by Bruce Lucas (Inactive) [ 09/Dec/19 ]

Can you please clarify the criteria you're using to identify these tickets?

We've been suspecting the issue when we see significant degradation between 4.0 and 4.2 accompanied by high CPU utilization and a low rate of eviction per CPU-second spent on eviction, and we've been confirming it when possible by collecting perf data and observing a large amount of CPU in __wt_row_leaf_key_work during eviction.

The latter signature is very distinctive. From the repro on this ticket:

                                                   4.0     4.2
 
mongod total                                     21725   34489
__wt_row_leaf_key_work total                      3414   17044
__wt_row_leaf_key_work during eviction              14   13426
__wt_row_leaf_key_work during __wt_row_search     3387    3440

In 4.0 we spend almost no time in __wt_row_leaf_key_work during eviction, whereas in 4.2 we spend a large fraction of our cycles there.

Interestingly, the time spent in __wt_row_leaf_key_work called from __wt_row_search only increased marginally in 4.2, whereas the time spent in __wt_row_leaf_key_work called from eviction increased from ~0 to half our cycles. Could it be that __wt_row_leaf_key_work isn't much more expensive, but for some reason we are calling it during eviction far more often?

Comment by Haribabu Kommi [ 09/Dec/19 ]

bruce.lucas Thanks for the reproducer. I am able to reproduce the performance problem by running it on both 4.0.13 and 4.2.1. Will update once we found additional details.

keith.bostic I checked the FTDC data of the customer for the reported stats that you mentioned. There is no stat available for the leaf pages around that time when they upgraded to 4.2.1, but there is a similar stat available for the internal pages.

Comment by Keith Bostic (Inactive) [ 09/Dec/19 ]

bruce.lucas, there's a WiredTiger statistic 'rec_prefix_compression', which is the "leaf page key bytes discarded using prefix compression", and I wondered if any of the ftdc data shows if these sites are seeing a benefit from prefix compression, or if this problem is a cost for no benefit for them?

Comment by Keith Bostic (Inactive) [ 09/Dec/19 ]

alexander.gorrod pointed out the obvious: as we're not yet writing the additional cell values, the biggest change here is in the size of the cell unpack structure and the code paths that would read the additional cell values if they were there. In other words, it may be structure size and code size changes that we need to focus on.

Comment by Keith Bostic (Inactive) [ 09/Dec/19 ]

bruce.lucas, thanks for pulling together possibly affected tickets. Can you please clarify the criteria you're using to identify these tickets?

Also, what's your sense of the priority for this problem?

Comment by Bruce Lucas (Inactive) [ 07/Dec/19 ]

Also, I'm not sure whether reconfiguring prefix compression to use it less often is satisfactory. That would make the problem occur less frequently, but would not make it less severe when it does occur. Also it would it help existing customers moving to 4.2?

Comment by Bruce Lucas (Inactive) [ 07/Dec/19 ]

I'm wondering if there's more going on than simply unpacking cells takes more time:

  • The ratio of CPU usage we're seeing suggests that unpacking cells is a large multiple more expensive than it used to be, if that's all that's going on. It might be worth looking into whether that's really expected given the code changes, and whether it's possible that we're unpacking cells more often than before.
  • So far we have only seen this effect when evicting pages, not when searching. Maybe we just don't happen to have hit on a workload that does a lot of searching, but since this involves indexes I think searching must be common. This probably bears more investigation.
Comment by Keith Bostic (Inactive) [ 06/Dec/19 ]

In the case of prefix compression, we unpack a lot more cells in order to figure out an entry's key on the page, and so searches and eviction will take more time (if unpacking a cell takes more time).

It will be difficult to speed up unpacking cells for prefix compression because the key is the last thing in the cell, and all of the intermediate fields have to be cracked to skip over them, there's no way to skip them and go directly to the key.

WiredTiger already instantiates a set of keys, in case the prefix compression extends over the entire page, to avoid losing performance to too much prefix compression. We could certainly increase the frequency of the keys we instantiate, which should help with this problem.

Additionally, it's possible to configure WiredTiger to only do prefix compression if there's a sufficient gain (the default "minimum gain" is 4 bytes, and MongoDB doesn't configure that value explicitly). Instead of turning off prefix compression entirely, we could limit prefix compression to keys where there's a larger gain, especially since strings of common bytes in the keys will quite likely be picked up and compressed by the stream compression engine.

In summary, it might be time to reconsider how we configure prefix compression, given that unpacking cells is more expensive than it used to be.  Obviously, it also depends on the relative cost of CPU vs. storage in our hardware platforms.

Edit:
Another possible improvement occurs to me: we currently pack & unpack these cell values individually, even when they're expected values (so, for example, WT_TS_NONE, or WT_TS_MAX). While we're not yet writing the values, we will be writing them in the next release and we're going to read/write the "expected" values a lot. We might want to consider using byte string matching for reading/writing the expected values, rather than packing/unpacking items individually, that should be faster.

 

Comment by Bruce Lucas (Inactive) [ 06/Dec/19 ]

Taking another look at the ftdc, in the 4.2.1 run on average about 2 of the application threads are doing eviction instead of doing application work, vs about 0.5 in the 4.0.13 run, so the total CPU time spent on eviction is actually a good bit larger in 4.2.1 than in 4.0.13, yet as noted the rate at which pages are evicted is lower.

Generated at Thu Feb 08 05:07:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.