[SERVER-47615] Investigate cause of index builds spending more CPU time in WT transaction commit Created: 17/Apr/20  Updated: 27/Oct/23  Resolved: 23/Apr/20

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: 4.4.0-rc0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Mark Callaghan (Inactive) Assignee: Backlog - Storage Execution Team
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mo425.Apr16.085309.rep     File mo425.Apr16.085309.svg     File mo425.Apr20.222253.svg     File mo440rc0.Apr16.001809.rep     File mo440rc0.Apr16.001809.svg     File mo440rc0.Apr20.185022.svg     File mo440rc1.Apr20.131341.svg     File mo44pre.Apr15.180608.svg     File mo44pre.Apr21.145849.svg    
Assigned Teams:
Storage Execution
Operating System: ALL
Participants:

 Description   

Update: This regression has been fixed since 4.4.0-rc1 due to SERVER-47407. This ticket will track the investigation of the initial regression.

By "much slower" I mean that it takes about 2X longer to create indexes for Linkbench in 4.4 compared to 4.2. Linkbench is in DSI, although I need to make a few changes to it so that the secondary index is created after the load as done here.

By 4.4 I mean both 4.4.0-rc0 and a build from the v4.4 branch prior to the merge of durable history.

From vmstat and iostat I see that 4.4 uses ~3X more CPU than 4.2. Both 4.2 and 4.4 do a similar amount of IO to storage – both read & write.

By "IO-bound" I mean that the test server has 16G of RAM and the indexed collection is ~80G per db.$collection.stats and mongod was setup as a single-node replicaset.

I will provide ftdc and CPU flamegraphs tomorrow. For now I start with a summary of performance data collected during the create index.

Results are provided for MongoDB versions 4.2.5, 4.4.0-rc0 and 4.4pre, where 4.4pre is a build from the v4.4 branch immediately prior to the merge of durable history.

The first table is mostly data from iostat and vmstat explained in "Index, Per Operation" here. This is performance data divided by the insert rate to estimate the performance overhead per inserted row.

From this table the value of "secs" is the number of seconds for the create index and that value is ~3X larger for 4.4pre and 4.4.0rc0 vs 4.2.5. Also the CPU consumption per indexed row (cpupi) is ~3X larger for the 4.4 binaries vs 4.2.5. But the amount of reads to and writes from storage (rkbpi, wkbpi) are similar between 4.4 and 4.2, so this looks like a new CPU problem.

ips     secs    rpi     rkbpi   wkbpi   cspi    cpupi   csecpq  dsecpq  csec    dsec    dbgb    cnf
311740  2440    0.001   0.133   0.067   0.0     84      0.0     3.2     0       2421    209.8   mo425.c5
107679  7064    0.002   0.132   0.066   0.0     239     0.0     9.3     0       7109    219.8   mo44pre.c5
121296  6271    0.001   0.144   0.066   0.0     218     0.0     8.4     0       6402    264.4   mo440rc0.c5

The next table is mostly data from iostat and vmstat explained in "Index, Per Second" here. Most of these are the average values for the counters from vmstat and iostat. From this I see that the IO rates to and from storage (rmbps, wmbps) are much larger for 4.2.5 and one reason for that is that 4.4 has more CPU overhead in between doing IO.

ips     secs    rps     rmbps   wmbps   csps    cpups   cutil   dutil   vsz     rss     cnf
311740  2440    448     40      21      1518    26.1    0.000   0.992   13.9    12.0    mo425.c5
107679  7064    181     14      7       1670    25.7    0.000   1.006   15.7    13.6    mo44pre.c5
121296  6271    141     17      8       1911    26.5    0.000   1.021   17.2    14.7    mo440rc0.c5



 Comments   
Comment by Mark Callaghan (Inactive) [ 23/Apr/20 ]

The ~30% perf improvement is nice. I will try to remember that later when I start to summarize my perf results that I report for 4.0, 4.2 and 4.4

Comment by Geert Bosch [ 23/Apr/20 ]

Hi Mark, thanks for sharing the updated data. Interesting point to note between the mo425 and mo44pre flame graphs is the time spent in the sorter (NoLimitSorter). In 4.2.5 most time was spent comparing items using BSONWoCompare, and now that time has completely disappeared and the Snappy compression of writing the output file becomes a far greater percentage of the total time spent. So, the work last summer of moving to KeyString for the external sorter is evident here in the performance improvements.

Comment by Mark Callaghan (Inactive) [ 21/Apr/20 ]

You are welcome to close this, and I am happy to answer more questions.

Comment by Mark Callaghan (Inactive) [ 21/Apr/20 ]

Test ended sooner than I expected. These ran on the same server after I fixed the problem explained here. IO-bound create index in 4.4 is faster than 4.2 – both before and after durable history was merged.

From an efficiency perspective, all versions tested use a similar amount of CPU, but 4.4 with durable history writes ~10X more to storage. In this case my storage wasn't the bottleneck so it didn't effect the create index time, but that is a problem. That isn't the target for this ticket so I will address that elsewhere.

A full guide to the metrics is here

The first table has many metrics normalized by (divided by) the insert rate (ips) and wkbpi is a problem for mo440rc[01] – wkbpi is KB written to storage per indexed row.

ips     secs    rpi     rkbpi   wkbpi   cspi    cpupi   csecpq  dsecpq  csec    dsec    dbgb    cnf
296316  2567    0.002   0.133   0.089   0.0     85      0.0     3.2     0       2467    214.9   mo425.c5
387688  1962    0.002   0.132   0.076   0.0     65      0.0     2.5     0       1887    215.8   mo44pre.c5
388679  1957    0.001   0.145   0.724   0.0     69      0.0     2.6     0       2010    270.9   mo440rc0.c5
395963  1921    0.001   0.145   0.631   0.0     67      0.0     2.6     0       1947    271.0   mo440rc1.c5

The next table has many of the same metrics reported as per-second rates, and wmbps (storage MB written/s, from iostat) is a problem for mo44rc[01]. vsz and rss (size of mongod process) are also problems.

ips     secs    rps     rmbps   wmbps   csps    cpups   cutil   dutil   vsz     rss     cnf
296316  2567    637     38      26      1830    25.1    0.000   0.961   14.1    12.2    mo425.c5
387688  1962    841     50      29      2369    25.2    0.000   0.962   14.1    12.2    mo44pre.c5
388679  1957    496     55      275     2353    26.9    0.000   1.027   18.7    15.2    mo440rc0.c5
395963  1921    501     56      244     2698    26.7    0.000   1.014   18.5    14.9    mo440rc1.c5

I will add a flamegraph for each of the versions collected at the same point in time of the index build. The three 4.4 versions look similar, but different from the 4.2.5 flamegraph.

mo425.Apr20.222253.svg
mo440rc0.Apr20.185022.svg
mo440rc1.Apr20.131341.svg
mo44pre.Apr21.145849.svg

Comment by Mark Callaghan (Inactive) [ 21/Apr/20 ]

Doing one more test, so I won't have results to share until tomorrow. The big news is that this was a bogus bug report. One my my test servers had a slow SSD and the lousy results for mo44pre and mo440rc0 above were from that server. That has since been fixed. For the story of the slow SSD read this

Comment by Mark Callaghan (Inactive) [ 20/Apr/20 ]

geert.bosch - I will provide that on Tuesday

Comment by Geert Bosch [ 20/Apr/20 ]

Mark, great news on rc1. I'm still concerned about the __wt_session_copy_values though, as in the normal non-index builds case we use WT transactions all the time. Question: could you add the flame-graph for RC1? I'd like to compare with 4.2.

Comment by Daniel Pasette (Inactive) [ 18/Apr/20 ]

That’s a healthy perf gain over 4.2. Thanks for checking mark.
I’m a bit concerned this didn’t show up on any of our regular CI perf tests.

Comment by Mark Callaghan (Inactive) [ 18/Apr/20 ]

Good news. Perf is better with 4.4.0-rc1. The secs column below is the number of seconds for the create index and it is 1901 for 4.4.0-rc1, which fixes the 4.4 regression and is now faster than it was in 4.2.5

I will let louis.williams decide when to close this.

ips     secs    rpi     rkbpi   wkbpi   cspi    cpupi   csecpq  dsecpq  csec    dsec    dbgb    cnf
311740  2440    0.001   0.133   0.067   0.0     84      0.0     3.2     0       2421    209.8   mo425.c5
107679  7064    0.002   0.132   0.066   0.0     239     0.0     9.3     0       7109    219.8   mo44pre.c5
121296  6271    0.001   0.144   0.066   0.0     218     0.0     8.4     0       6402    264.4   mo440rc0.c5
400129  1901    0.001   0.145   0.067   0.0     66      0.0     2.6     0       1953    264.4   mo440rc1.c5

Comment by Mark Callaghan (Inactive) [ 17/Apr/20 ]

Added ascii (perf report --stdio) output that lead to the flamegraphs for 4.2.5 and 4.4.0-rc0. WriteUnitOfWork is in there for 4.2.5 but not significant.

With respect to being optimistic that this is fixed in 4.4.0-rc1, I will repeat the test for that soon.

Comment by Louis Williams [ 17/Apr/20 ]

It's possible the regression has gone away in 4.4.0-rc1 as of SERVER-47407, which avoids the WriteUnitOfWork for index builds. The reason why we don't need the WUOW is that during the collection scan phase, index builds write to the external sorter, which is not part of the storage engine. That WUOW has effectively been a no-op since 4.2 with hybrid index builds.

I would still like to understand what changed between 4.2 and 4.4 that make the commit() take longer. For example, does __wt_session_copy_values show up in the 4.2 flame graph at all?

Comment by Mark Callaghan (Inactive) [ 17/Apr/20 ]

The preprocessor output from my mo440rc0 build, HAVE_DIAGNOSTIC is not defined

int
__wt_session_copy_values(WT_SESSION_IMPL *session)
{
    WT_CURSOR *cursor;
 
    for ((cursor) = (((&session->cursors))->tqh_first); (cursor); (cursor) = (((cursor))->q.tqe_next))
        if (((((cursor)->flags) & (0x100000u)) != 0)) {
# 140 "src/third_party/wiredtiger/src/session/session_api.c"
            do { int __ret; if ((__ret = (__cursor_localvalue(cursor))) != 0) return (__ret); } while (0);
        }
 
    return (0);
}

Comment by Mark Callaghan (Inactive) [ 17/Apr/20 ]

Unless HAVE_DIAGNOSTIC is defined for one build but not the other, the code has not changed. Perhaps session->cursors has more entries in 4.4

Comment by Geert Bosch [ 17/Apr/20 ]

The flame graph makes the problem obvious: all time is spent in __wt_session_copy_values.

Comment by Mark Callaghan (Inactive) [ 17/Apr/20 ]

Normally my scripts save ftdc, but the way I ran them this time I don't have ftdc. But I do have flamegraphs and they look useful.

I have 80 flamegraphs from mo425 and ~200 from mo44pre and mo440rc0. I looked at 3 for each, selected randomly.
1) The mo425 graphs are all similar.
2) The mo44* graphs are different from mo425 and similar to each other.

They are attached and in 4.4 mongo::WriteUnitOfWork::commit and what it calls uses a lot more CPU and perhaps __wt_session_copy_values is a problem

Comment by Geert Bosch [ 17/Apr/20 ]

The csec and dsec columns are exactly equal. That can't be right! One thing I wonder about is system vs user CPU usage. Also for the cpupi scale, is 1 cpupi a microsecond for 1 logical CPU or for all logical CPUs? 

If I look at cutil vs dutil, it looks like CPU usage is not really the issue, but rather disk usage. Still, how is cutil zero while cpups is 26? Is cpups a percentage? This ticket says 3X CPU usage, but if the CPU is largely idle that seems irrelevant. Is it that we do more fsync or similar? We did change a number of things around journaling, and I'd not be surprised if that caused issues. In particular, single-node replica-sets are somewhat strange beasts, as they're a replica-set, but at the same time don't actually need to confirm majorities etc. So, I'd be unsurprised if that's an issue, and it would also be less worrying/important. How do these tests look for a 3-node replset?

Comment by Mark Callaghan (Inactive) [ 17/Apr/20 ]

Assigned to Geert because Jira required that I assign it to someone, and we discussed this earlier. Please reassign.

Generated at Thu Feb 08 05:14:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.