[SERVER-47615] Investigate cause of index builds spending more CPU time in WT transaction commit Created: 17/Apr/20 Updated: 27/Oct/23 Resolved: 23/Apr/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Index Maintenance |
| Affects Version/s: | 4.4.0-rc0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Mark Callaghan (Inactive) | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Assigned Teams: |
Storage Execution
|
| Operating System: | ALL |
| Participants: |
| Description |
|
Update: This regression has been fixed since 4.4.0-rc1 due to By "much slower" I mean that it takes about 2X longer to create indexes for Linkbench in 4.4 compared to 4.2. Linkbench is in DSI, although I need to make a few changes to it so that the secondary index is created after the load as done here. By 4.4 I mean both 4.4.0-rc0 and a build from the v4.4 branch prior to the merge of durable history. From vmstat and iostat I see that 4.4 uses ~3X more CPU than 4.2. Both 4.2 and 4.4 do a similar amount of IO to storage – both read & write. By "IO-bound" I mean that the test server has 16G of RAM and the indexed collection is ~80G per db.$collection.stats and mongod was setup as a single-node replicaset. I will provide ftdc and CPU flamegraphs tomorrow. For now I start with a summary of performance data collected during the create index. Results are provided for MongoDB versions 4.2.5, 4.4.0-rc0 and 4.4pre, where 4.4pre is a build from the v4.4 branch immediately prior to the merge of durable history. The first table is mostly data from iostat and vmstat explained in "Index, Per Operation" here. This is performance data divided by the insert rate to estimate the performance overhead per inserted row. From this table the value of "secs" is the number of seconds for the create index and that value is ~3X larger for 4.4pre and 4.4.0rc0 vs 4.2.5. Also the CPU consumption per indexed row (cpupi) is ~3X larger for the 4.4 binaries vs 4.2.5. But the amount of reads to and writes from storage (rkbpi, wkbpi) are similar between 4.4 and 4.2, so this looks like a new CPU problem.
The next table is mostly data from iostat and vmstat explained in "Index, Per Second" here. Most of these are the average values for the counters from vmstat and iostat. From this I see that the IO rates to and from storage (rmbps, wmbps) are much larger for 4.2.5 and one reason for that is that 4.4 has more CPU overhead in between doing IO.
|
| Comments |
| Comment by Mark Callaghan (Inactive) [ 23/Apr/20 ] | ||||||||||||||
|
The ~30% perf improvement is nice. I will try to remember that later when I start to summarize my perf results that I report for 4.0, 4.2 and 4.4 | ||||||||||||||
| Comment by Geert Bosch [ 23/Apr/20 ] | ||||||||||||||
|
Hi Mark, thanks for sharing the updated data. Interesting point to note between the mo425 and mo44pre flame graphs is the time spent in the sorter (NoLimitSorter). In 4.2.5 most time was spent comparing items using BSONWoCompare, and now that time has completely disappeared and the Snappy compression of writing the output file becomes a far greater percentage of the total time spent. So, the work last summer of moving to KeyString for the external sorter is evident here in the performance improvements. | ||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 21/Apr/20 ] | ||||||||||||||
|
You are welcome to close this, and I am happy to answer more questions. | ||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 21/Apr/20 ] | ||||||||||||||
|
Test ended sooner than I expected. These ran on the same server after I fixed the problem explained here. IO-bound create index in 4.4 is faster than 4.2 – both before and after durable history was merged. From an efficiency perspective, all versions tested use a similar amount of CPU, but 4.4 with durable history writes ~10X more to storage. In this case my storage wasn't the bottleneck so it didn't effect the create index time, but that is a problem. That isn't the target for this ticket so I will address that elsewhere. A full guide to the metrics is here The first table has many metrics normalized by (divided by) the insert rate (ips) and wkbpi is a problem for mo440rc[01] – wkbpi is KB written to storage per indexed row.
The next table has many of the same metrics reported as per-second rates, and wmbps (storage MB written/s, from iostat) is a problem for mo44rc[01]. vsz and rss (size of mongod process) are also problems.
I will add a flamegraph for each of the versions collected at the same point in time of the index build. The three 4.4 versions look similar, but different from the 4.2.5 flamegraph.
| ||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 21/Apr/20 ] | ||||||||||||||
|
Doing one more test, so I won't have results to share until tomorrow. The big news is that this was a bogus bug report. One my my test servers had a slow SSD and the lousy results for mo44pre and mo440rc0 above were from that server. That has since been fixed. For the story of the slow SSD read this | ||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 20/Apr/20 ] | ||||||||||||||
|
geert.bosch - I will provide that on Tuesday | ||||||||||||||
| Comment by Geert Bosch [ 20/Apr/20 ] | ||||||||||||||
|
Mark, great news on rc1. I'm still concerned about the __wt_session_copy_values though, as in the normal non-index builds case we use WT transactions all the time. Question: could you add the flame-graph for RC1? I'd like to compare with 4.2. | ||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 18/Apr/20 ] | ||||||||||||||
|
That’s a healthy perf gain over 4.2. Thanks for checking mark. | ||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 18/Apr/20 ] | ||||||||||||||
|
Good news. Perf is better with 4.4.0-rc1. The secs column below is the number of seconds for the create index and it is 1901 for 4.4.0-rc1, which fixes the 4.4 regression and is now faster than it was in 4.2.5 I will let louis.williams decide when to close this.
| ||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 17/Apr/20 ] | ||||||||||||||
|
Added ascii (perf report --stdio) output that lead to the flamegraphs for 4.2.5 and 4.4.0-rc0. WriteUnitOfWork is in there for 4.2.5 but not significant. With respect to being optimistic that this is fixed in 4.4.0-rc1, I will repeat the test for that soon. | ||||||||||||||
| Comment by Louis Williams [ 17/Apr/20 ] | ||||||||||||||
|
It's possible the regression has gone away in 4.4.0-rc1 as of I would still like to understand what changed between 4.2 and 4.4 that make the commit() take longer. For example, does __wt_session_copy_values show up in the 4.2 flame graph at all? | ||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 17/Apr/20 ] | ||||||||||||||
|
The preprocessor output from my mo440rc0 build, HAVE_DIAGNOSTIC is not defined
| ||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 17/Apr/20 ] | ||||||||||||||
|
Unless HAVE_DIAGNOSTIC is defined for one build but not the other, the code has not changed. Perhaps session->cursors has more entries in 4.4 | ||||||||||||||
| Comment by Geert Bosch [ 17/Apr/20 ] | ||||||||||||||
|
The flame graph makes the problem obvious: all time is spent in __wt_session_copy_values. | ||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 17/Apr/20 ] | ||||||||||||||
|
Normally my scripts save ftdc, but the way I ran them this time I don't have ftdc. But I do have flamegraphs and they look useful. I have 80 flamegraphs from mo425 and ~200 from mo44pre and mo440rc0. I looked at 3 for each, selected randomly. They are attached and in 4.4 mongo::WriteUnitOfWork::commit and what it calls uses a lot more CPU and perhaps __wt_session_copy_values is a problem | ||||||||||||||
| Comment by Geert Bosch [ 17/Apr/20 ] | ||||||||||||||
|
The csec and dsec columns are exactly equal. That can't be right! One thing I wonder about is system vs user CPU usage. Also for the cpupi scale, is 1 cpupi a microsecond for 1 logical CPU or for all logical CPUs? If I look at cutil vs dutil, it looks like CPU usage is not really the issue, but rather disk usage. Still, how is cutil zero while cpups is 26? Is cpups a percentage? This ticket says 3X CPU usage, but if the CPU is largely idle that seems irrelevant. Is it that we do more fsync or similar? We did change a number of things around journaling, and I'd not be surprised if that caused issues. In particular, single-node replica-sets are somewhat strange beasts, as they're a replica-set, but at the same time don't actually need to confirm majorities etc. So, I'd be unsurprised if that's an issue, and it would also be less worrying/important. How do these tests look for a 3-node replset? | ||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 17/Apr/20 ] | ||||||||||||||
|
Assigned to Geert because Jira required that I assign it to someone, and we discussed this earlier. Please reassign. |