[SERVER-44991] Performance regression in indexes with keys with common prefixes Created: 06/Dec/19 Updated: 29/Oct/23 Resolved: 06/Jan/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 4.2.1 |
| Fix Version/s: | 4.2.3 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Haribabu Kommi |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | KP42, perf-escapes | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Sprint: | Storage Engines 2019-12-16, Storage Engines 2019-12-30, Storage Engines 2020-01-13 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||
| Story Points: | 1 | ||||||||||||||||||||||||
| Description |
|
Test creates 5 collections with 10 indexes each. Indexes are designed to have keys with substantial common prefixes; this seemed to be important to generate the issue. Collections are populated, then are sparsely updated in parallel, aiming for roughly 1 update per page, in order to generate dirty pages at a high rate with little application work.
Left is 4.0.13, right is 4.2.1. A-B and C-D are the update portions of the test. The performance regression is seen in the timespan for the update portion, and in the average latency. The rate of pages written to disk and pages evicted is substantially lower in 4.2.1. In this test dirty fill ratio is pegged at 20%, so the rate at which pages can be written to disk is the bottleneck. The test is run on a 24-CPU machine, so the CPU utilization during both tests is roughly what would be expected with ~5 constantly active application threads, plus 3-4 constantly active eviction threads. But in spite of the same CPU activity for eviction, we are evicting pages at a much lower rate, so we must be using more CPU per page evicted in 4.2.1. Perf shows that this additional CPU activity is accounted for by __wt_row_leaf_key_work. |
| Comments |
| Comment by Haribabu Kommi [ 06/Jan/20 ] | |||||||||||||||
|
Thanks for confirming the fix zyd_com. The fix will be available in the next 4.2 minor version. | |||||||||||||||
| Comment by Githook User [ 02/Jan/20 ] | |||||||||||||||
|
Author: {'name': 'Hari Babu Kommi', 'email': 'haribabu.kommi@mongodb.com', 'username': 'kommiharibabu'}Message: Reconciliation has some fast-path code to work on prefix-compressed | |||||||||||||||
| Comment by Zhang Youdong [ 19/Dec/19 ] | |||||||||||||||
|
This patch works for me,thank you!
| |||||||||||||||
| Comment by Githook User [ 13/Dec/19 ] | |||||||||||||||
|
Author: {'name': 'Hari Babu Kommi', 'email': 'haribabu.kommi@mongodb.com', 'username': 'kommiharibabu'}Message: Reconciliation has some fast-path code to work on prefix-compressed | |||||||||||||||
| Comment by Haribabu Kommi [ 13/Dec/19 ] | |||||||||||||||
|
keith.bostic, Thanks for the details. I reverted the first change in the | |||||||||||||||
| Comment by Keith Bostic (Inactive) [ 12/Dec/19 ] | |||||||||||||||
|
Thank you, haribabu.kommi, bruce.lucas, I think you've figured it out. In short, there's some fast-path code in reconciliation (the same code you flagged, here]). What's happening is we're working our way through the K/V pairs on the page, and they may be prefix compressed. Rather than figure out each key from scratch (which would require going back some number of K/V pairs to key that's not prefix-compressed and then rolling forward), we use the complete key we just built for the last K/V pair's reconciliation, and apply the current key's prefix-compression change to it, saving us quite a bit of work. The change in We can probably just back out the first change in the WT-4642 commit, and we'll go back to our usual performance in the case of prefix compression. I want to stare pretty hard at it though, both to ensure we didn't break something else while this change was in place (it was only 8 months ago, so I'm not too worried about that), but also to make sure the compiler's warning that those fields were used in an uninitialized way (the report that resulted in the | |||||||||||||||
| Comment by Haribabu Kommi [ 12/Dec/19 ] | |||||||||||||||
|
bruce.lucas, Thanks for your further investigation of the issue.
your above analysis of function calls is absolutely correct. There is quite a number of calls difference between these two functions between 4.0 and 4.2 is the reason behind the performance drop.
There is a small correction to the above analysis. The WT_CELL_FLAG flag scenario is happening in 4.0.13 version also and _wt_row_leaf_key_info function returns false as it is done for 4.2.1 version, but instead of going to wt_row_leaf_key_copy, it reaches the function _rec_cell_build_leaf_key because of the following code here. I added counters to find out how many time does the above check makes it into __rec_cell_build_leaf_key function in both 4.0.13 and 4.2.1 versions.
If we observe the counters, most of the "direct to copy function calls" is converted into "build calls" in 4.0.13. The same "if check code" which I referred above present in both versions 4.0.13 and 4.2. But because of the code here, the output size coming as zero and it leads to condition failure and it always go to __wt_row_leaf_key_copy. The size resetting code was added as part of Will look into it further of using "size" without resetting it has any problem or not? | |||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 11/Dec/19 ] | |||||||||||||||
|
I did a little investigating with gdb to see if I could understand why perf shows __wt_rec_row_leaf behaving differently, i.e
It looks like __wt_rec_row_leaf calls __wt_row_leaf_key_info here, and then
I think this means that when reconciliation is done in 4.0 the keys are already instantiated, whereas in 4.2 they are not. keith.bostic is this a reasonable interpretation of the above? Any idea why this might be the case? (Note: when following the behavior with gdb you have to be careful to note which wt table the page being reconciled belongs to, as only some of the tables use prefix compression. A simple way to identify them is that they are the set of index-*.wt tables with larger sizes; the smaller index-*.wt tables are the _id indexes which are not of interest here, and the collection-*.wt tables are also not of interest.) | |||||||||||||||
| Comment by Haribabu Kommi [ 11/Dec/19 ] | |||||||||||||||
|
bruce.lucas, Thanks for providing more details that are really helpful. I changed the memory_page_image_max=32k to every collection to eliminate the large page size is the problem for the performance drop, but as you said earlier, the performance doesn't improve even the number of pages are same in both 4.0 and 4.2 versions.
Upon further investigating the problem, we believe that 4.0.13
4.2.1
I will continue further to find out the root cause of | |||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 10/Dec/19 ] | |||||||||||||||
|
That's a good observation and could be an issue for some workloads, and is worth looking into, but I don't think that's what's going on here. If the density of updates is <1 per 4.0 page, then indeed larger 4.2 pages means that more bytes will be reconciled and written to disk per update, and that could impact performance for workloads that do sparse updates. However in this case the density of updates is ~1 or more per 4.0 page, so while the 4.2 pages are larger, each page will contain more updates when it is written, so the bytes reconciled and written per update will be about the same. In that case I would think the cost should be about the same. We can look at the WT pages and bytes written from cache metrics via db.c.stats({indexDetails: true}) to confirm this and get more detail about the page sizes.
This confirms that the total number of bytes reconciled and written is the same in 4.2, so shouldn't the cost be the same? It also confirms that bytes/page is about 4x larger in 4.2, but only for the collection table - the index tables have the same page sizes in 4.2. As far as I've been able to tell the extra CPU time we observed in perf is related to prefix compression, which is only enabled for the index tables, and their page sizes haven't changed. | |||||||||||||||
| Comment by Haribabu Kommi [ 10/Dec/19 ] | |||||||||||||||
|
During the analysis of the FTDC data, we found out that cache byte belongs to page images are almost same for both 4.2 and 4.0 versions but there is a difference in the number of pages held by the cache compared to 4.0.13, this difference feels like that the page size in 4.2 seems to be larger than 4.0.13 can cause performance problem during the reconciliation. Currently, I am investigating the reason behind large page sizes in 4.2. | |||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 09/Dec/19 ] | |||||||||||||||
|
To summarize offline discussion with Keith: Those numbers are actually perf sample counts, and perf sampling will tell you number of calls times cost of each call, so inconclusive by itself as to the number of calls. But in this case perf shows us that the time in __wt_rec_row_leaf is largely __wt_row_leaf_key_work in 4.2 but largely __rec_cell_build_leaf_key in 4.0, and we know that __wt_rec_row_leaf makes a decision between those two calls, so it seems pretty likely to me that it’s making a different decision between 4.0 and 4.2, resulting in (a lot) more calls to the expensive __wt_rec_row_leaf in 4.2. | |||||||||||||||
| Comment by Keith Bostic (Inactive) [ 09/Dec/19 ] | |||||||||||||||
I can't think of any reason this would be the case, and I don't see any significant changes that would imply it. haribabu.kommi, can you instrument your reproducer and tell us if we're calling __wt_row_leaf_key_work from eviction more often than before?
Edit: bruce.lucas notes the leading numbers are call counts, so we know there was a change in total calls. | |||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 09/Dec/19 ] | |||||||||||||||
|
There's no useful stats on prefix compression in ftdc, but you can check the stats for a particular collection using e.g. db.c1.stats({indexDetails:true}). Doing that on the repro shows that we are as expected getting significant benefit from prefix compression in the repro: index checkpoint size is ~10 MB, after building the index "leaf page key bytes discarded using prefix compression" is about 200 MB, and after doing the updates "leaf page key bytes discarded using prefix compression" is about 500 MB. Those stats are about the same for 4.0 and 4.2, so the difference in behavior of __wt_rec_row_leaf mentioned above doesn't seem to be due to any difference in prefix compression. | |||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 09/Dec/19 ] | |||||||||||||||
|
The top stack traces from perf for eviction are very different between 4.2 and 4.0, and support the possibility that the problem is not in __wt_row_leaf_key_work but rather in how often we call it. 4.0
4.2
In other words, in 4.0 we almost never call __wt_row_leaf_key_copy from __wt_rec_row_leaf, but instead call the (apparently more efficient) __rec_cell_build_leaf_key. | |||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 09/Dec/19 ] | |||||||||||||||
|
Regarding priority, I think it will be a significant problem for an unknown fraction of our users. For many of those affected I think it will likely prevent a successful upgrade to 4.2. | |||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 09/Dec/19 ] | |||||||||||||||
We've been suspecting the issue when we see significant degradation between 4.0 and 4.2 accompanied by high CPU utilization and a low rate of eviction per CPU-second spent on eviction, and we've been confirming it when possible by collecting perf data and observing a large amount of CPU in __wt_row_leaf_key_work during eviction. The latter signature is very distinctive. From the repro on this ticket:
In 4.0 we spend almost no time in __wt_row_leaf_key_work during eviction, whereas in 4.2 we spend a large fraction of our cycles there. Interestingly, the time spent in __wt_row_leaf_key_work called from __wt_row_search only increased marginally in 4.2, whereas the time spent in __wt_row_leaf_key_work called from eviction increased from ~0 to half our cycles. Could it be that __wt_row_leaf_key_work isn't much more expensive, but for some reason we are calling it during eviction far more often? | |||||||||||||||
| Comment by Haribabu Kommi [ 09/Dec/19 ] | |||||||||||||||
|
bruce.lucas Thanks for the reproducer. I am able to reproduce the performance problem by running it on both 4.0.13 and 4.2.1. Will update once we found additional details. keith.bostic I checked the FTDC data of the customer for the reported stats that you mentioned. There is no stat available for the leaf pages around that time when they upgraded to 4.2.1, but there is a similar stat available for the internal pages.
| |||||||||||||||
| Comment by Keith Bostic (Inactive) [ 09/Dec/19 ] | |||||||||||||||
|
bruce.lucas, there's a WiredTiger statistic 'rec_prefix_compression', which is the "leaf page key bytes discarded using prefix compression", and I wondered if any of the ftdc data shows if these sites are seeing a benefit from prefix compression, or if this problem is a cost for no benefit for them? | |||||||||||||||
| Comment by Keith Bostic (Inactive) [ 09/Dec/19 ] | |||||||||||||||
|
alexander.gorrod pointed out the obvious: as we're not yet writing the additional cell values, the biggest change here is in the size of the cell unpack structure and the code paths that would read the additional cell values if they were there. In other words, it may be structure size and code size changes that we need to focus on. | |||||||||||||||
| Comment by Keith Bostic (Inactive) [ 09/Dec/19 ] | |||||||||||||||
|
bruce.lucas, thanks for pulling together possibly affected tickets. Can you please clarify the criteria you're using to identify these tickets? Also, what's your sense of the priority for this problem? | |||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 07/Dec/19 ] | |||||||||||||||
|
Also, I'm not sure whether reconfiguring prefix compression to use it less often is satisfactory. That would make the problem occur less frequently, but would not make it less severe when it does occur. Also it would it help existing customers moving to 4.2? | |||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 07/Dec/19 ] | |||||||||||||||
|
I'm wondering if there's more going on than simply unpacking cells takes more time:
| |||||||||||||||
| Comment by Keith Bostic (Inactive) [ 06/Dec/19 ] | |||||||||||||||
|
In the case of prefix compression, we unpack a lot more cells in order to figure out an entry's key on the page, and so searches and eviction will take more time (if unpacking a cell takes more time). It will be difficult to speed up unpacking cells for prefix compression because the key is the last thing in the cell, and all of the intermediate fields have to be cracked to skip over them, there's no way to skip them and go directly to the key. WiredTiger already instantiates a set of keys, in case the prefix compression extends over the entire page, to avoid losing performance to too much prefix compression. We could certainly increase the frequency of the keys we instantiate, which should help with this problem. Additionally, it's possible to configure WiredTiger to only do prefix compression if there's a sufficient gain (the default "minimum gain" is 4 bytes, and MongoDB doesn't configure that value explicitly). Instead of turning off prefix compression entirely, we could limit prefix compression to keys where there's a larger gain, especially since strings of common bytes in the keys will quite likely be picked up and compressed by the stream compression engine. In summary, it might be time to reconsider how we configure prefix compression, given that unpacking cells is more expensive than it used to be. Obviously, it also depends on the relative cost of CPU vs. storage in our hardware platforms. Edit:
| |||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 06/Dec/19 ] | |||||||||||||||
|
Taking another look at the ftdc, in the 4.2.1 run on average about 2 of the application threads are doing eviction instead of doing application work, vs about 0.5 in the 4.0.13 run, so the total CPU time spent on eviction is actually a good bit larger in 4.2.1 than in 4.0.13, yet as noted the rate at which pages are evicted is lower. |