[SERVER-81191] Checking array field names in IDL is expensive Created: 19/Sep/23  Updated: 09/Nov/23  Resolved: 06/Nov/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.2.0-rc0

Type: Improvement Priority: Major - P3
Reporter: Matthew Russotto Assignee: Patrick Freed
Resolution: Fixed Votes: 0
Labels: perf-tiger, perf-tiger-handoff, perf-tiger-triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
causes SERVER-82983 Fix ambiguity formatting DecimalCount... Closed
Assigned Teams:
Service Arch
Backwards Compatibility: Fully Compatible
Sprint: Service Arch 2023-10-02, Service Arch 2023-10-16, Service Arch 2023-10-30, Service Arch 2023-11-13
Participants:

 Description   

When we parse a BSON array in IDL, we check that all the field names are as expected using the general purpose NumberParser class, and for larger arrays this can take significant time. For an example, here is the parsing of an oplog entry using code from SERVER-81101 with NumberParser:

--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_ParseOplogEntryWithNoStatementId                204 ns          204 ns      2681374
BM_ParseOplogEntryWithOneStatementId               239 ns          239 ns      2939634
BM_ParseOplogEntryWithMultiStatementId/2           328 ns          328 ns      2138516
BM_ParseOplogEntryWithMultiStatementId/8           498 ns          498 ns      1405159
BM_ParseOplogEntryWithMultiStatementId/64         1985 ns         1985 ns       352795
BM_ParseOplogEntryWithMultiStatementId/512       14392 ns        14392 ns        48773
BM_ParseOplogEntryWithMultiStatementId/1000      27760 ns        27760 ns        25179

(The parameter is the number of entries in the statementID array; the first two benchmarks are not using arrays.)

Here is the same benchmark with all array fieldname checking removed

--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_ParseOplogEntryWithNoStatementId                201 ns          201 ns      2642658
BM_ParseOplogEntryWithOneStatementId               236 ns          236 ns      2963950
BM_ParseOplogEntryWithMultiStatementId/2           287 ns          287 ns      2437392
BM_ParseOplogEntryWithMultiStatementId/8           399 ns          399 ns      1756477
BM_ParseOplogEntryWithMultiStatementId/64         1035 ns         1035 ns       676153
BM_ParseOplogEntryWithMultiStatementId/512        6088 ns         6088 ns       115098
BM_ParseOplogEntryWithMultiStatementId/1000      11527 ns        11526 ns        60909

And here is the code using the C++ "std::from_chars" method

--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_ParseOplogEntryWithNoStatementId                204 ns          204 ns      2627978
BM_ParseOplogEntryWithOneStatementId               237 ns          237 ns      2951225
BM_ParseOplogEntryWithMultiStatementId/2           298 ns          298 ns      2350274
BM_ParseOplogEntryWithMultiStatementId/8           401 ns          401 ns      1743373
BM_ParseOplogEntryWithMultiStatementId/64         1138 ns         1138 ns       614838
BM_ParseOplogEntryWithMultiStatementId/512        7032 ns         7032 ns        99665
BM_ParseOplogEntryWithMultiStatementId/1000      13535 ns        13534 ns        51844

I tried a few other things like encoding the expected field number and comparing that, and incrementing the expected field number represented as a string; they weren't faster than from_chars.

These timings are on my Intel workstation so not too precise, but the differences are significant.

Ideally we wouldn't even have field names in BSON arrays but I think that ship has long since sailed.



 Comments   
Comment by Githook User [ 03/Nov/23 ]

Author:

{'name': 'Patrick Freed', 'email': 'patrick.freed@mongodb.com', 'username': 'patrickfreed'}

Message: SERVER-81191 Improve performance of BSON array key validation
Branch: master
https://github.com/mongodb/mongo/commit/6edac6aaa7eed19b950667875f7a5384b486be27

Comment by Patrick Freed [ 05/Oct/23 ]

I reran the benchmarks on my Intel machine, this time with gcc, and I am seeing similar results now, sorry for that oversight. I'm curious why there's such a discrepancy here, I was under the impression that we used the same libstdc++ on both gcc and clang, except if the --libc++ flag were passed to the scons invocation. Given that, I would've expected these to be closer, but I guess gcc just has some better optimization for this particular path.

Given that it's relatively close when built with clang and dramatically faster when built with gcc, it seems prudent to go with from_chars after all then, at least for integers. The release binaries are built with gcc, so that's what matters more anyways. I'll revisit the actual implementation tomorrow.

Comment by Matthew Russotto [ 04/Oct/23 ]

Running your benchmark on an ARM virtual workstation I get

BM_NumberParserInteger10kPositive/1            20.4 ns         20.4 ns     34266149 bytes_per_second=186.687M/s
BM_NumberParserInteger10kPositive/8             171 ns          171 ns      4089289 bytes_per_second=173.202M/s
BM_NumberParserInteger10kPositive/64           1306 ns         1306 ns       535909 bytes_per_second=181.811M/s
BM_NumberParserInteger10kPositive/512         10365 ns        10365 ns        67545 bytes_per_second=183M/s
BM_NumberParserInteger10kPositive/4096        82790 ns        82792 ns         8457 bytes_per_second=183.359M/s
BM_NumberParserInteger10kPositive/32768      662156 ns       662165 ns         1057 bytes_per_second=183.545M/s
BM_NumberParserInteger10kPositive/100000    2022428 ns      2022455 ns          346 bytes_per_second=183.356M/s
BM_from_charsInteger10kPositive/1              10.8 ns         10.8 ns     64718027 bytes_per_second=352.738M/s
BM_from_charsInteger10kPositive/8              78.9 ns         78.9 ns      8870534 bytes_per_second=374.665M/s
BM_from_charsInteger10kPositive/64              675 ns          675 ns      1036389 bytes_per_second=351.559M/s
BM_from_charsInteger10kPositive/512            5381 ns         5381 ns       130195 bytes_per_second=352.518M/s
BM_from_charsInteger10kPositive/4096          43142 ns        43142 ns        16224 bytes_per_second=351.875M/s
BM_from_charsInteger10kPositive/32768        344860 ns       344859 ns         2030 bytes_per_second=352.425M/s
BM_from_charsInteger10kPositive/100000      1052597 ns      1052611 ns          665 bytes_per_second=352.295M/s

which I believe roughly matches what I got before with from_chars being a LOT faster for small integers.

It's also dramatically faster for full range integers:

BM_NumberParserIntegerMax/1                    39.2 ns         39.2 ns     17839153 bytes_per_second=243.104M/s
BM_NumberParserIntegerMax/8                     299 ns          299 ns      2344642 bytes_per_second=261.649M/s
BM_NumberParserIntegerMax/64                   2426 ns         2426 ns       288183 bytes_per_second=251.179M/s
BM_NumberParserIntegerMax/512                 20563 ns        20563 ns        33988 bytes_per_second=235.093M/s
BM_NumberParserIntegerMax/4096               172544 ns       172546 ns         4049 bytes_per_second=225.317M/s
BM_NumberParserIntegerMax/32768             1388746 ns      1388765 ns          503 bytes_per_second=224.506M/s
BM_NumberParserIntegerMax/100000            4244351 ns      4244411 ns          165 bytes_per_second=224.248M/s
BM_from_charsIntegerMax/1                      22.8 ns         22.8 ns     30647355 bytes_per_second=417.579M/s
BM_from_charsIntegerMax/8                       185 ns          185 ns      3766580 bytes_per_second=421.715M/s
BM_from_charsIntegerMax/64                     1652 ns         1652 ns       421821 bytes_per_second=368.904M/s
BM_from_charsIntegerMax/512                   14213 ns        14213 ns        49228 bytes_per_second=340.119M/s
BM_from_charsIntegerMax/4096                 115429 ns       115428 ns         6065 bytes_per_second=336.81M/s
BM_from_charsIntegerMax/32768                922999 ns       923012 ns          758 bytes_per_second=337.792M/s
BM_from_charsIntegerMax/100000              2812705 ns      2812670 ns          249 bytes_per_second=338.398M/s

This is with compile options taken from the perf project build. With the --opt build style (which is a clang build, not GCC), I get different results:

BM_NumberParserInteger10kPositive/1            23.2 ns         23.2 ns     30117302 bytes_per_second=164.127M/s
BM_NumberParserInteger10kPositive/8             201 ns          201 ns      3489389 bytes_per_second=147.324M/s
BM_NumberParserInteger10kPositive/64           1525 ns         1525 ns       459050 bytes_per_second=155.726M/s
BM_NumberParserInteger10kPositive/512         12084 ns        12084 ns        57920 bytes_per_second=156.97M/s
BM_NumberParserInteger10kPositive/4096        96591 ns        96591 ns         7247 bytes_per_second=157.163M/s
BM_NumberParserInteger10kPositive/32768      772679 ns       772666 ns          906 bytes_per_second=157.296M/s
BM_NumberParserInteger10kPositive/100000    2359209 ns      2359190 ns          297 bytes_per_second=157.185M/s
BM_from_charsInteger10kPositive/1              18.9 ns         18.9 ns     37082976 bytes_per_second=202.107M/s
BM_from_charsInteger10kPositive/8               140 ns          140 ns      4988931 bytes_per_second=210.717M/s
BM_from_charsInteger10kPositive/64             1178 ns         1178 ns       594116 bytes_per_second=201.579M/s
BM_from_charsInteger10kPositive/512            9649 ns         9649 ns        72592 bytes_per_second=196.588M/s
BM_from_charsInteger10kPositive/4096          77384 ns        77382 ns         9044 bytes_per_second=196.176M/s
BM_from_charsInteger10kPositive/32768        617912 ns       617911 ns         1132 bytes_per_second=196.69M/s
BM_from_charsInteger10kPositive/100000      1884988 ns      1884962 ns          371 bytes_per_second=196.731M/s
BM_NumberParserIntegerMax/1                    44.3 ns         44.3 ns     15806660 bytes_per_second=215.291M/s
BM_NumberParserIntegerMax/8                     371 ns          371 ns      1887535 bytes_per_second=210.844M/s
BM_NumberParserIntegerMax/64                   2970 ns         2970 ns       235626 bytes_per_second=205.201M/s
BM_NumberParserIntegerMax/512                 23735 ns        23735 ns        29471 bytes_per_second=203.669M/s
BM_NumberParserIntegerMax/4096               191020 ns       191018 ns         3665 bytes_per_second=203.527M/s
BM_NumberParserIntegerMax/32768             1532075 ns      1532043 ns          457 bytes_per_second=203.51M/s
BM_NumberParserIntegerMax/100000            4676518 ns      4676405 ns          150 bytes_per_second=203.533M/s
BM_from_charsIntegerMax/1                      33.8 ns         33.8 ns     20738175 bytes_per_second=282.423M/s
BM_from_charsIntegerMax/8                       260 ns          260 ns      2690719 bytes_per_second=300.642M/s
BM_from_charsIntegerMax/64                     2022 ns         2022 ns       346540 bytes_per_second=301.452M/s
BM_from_charsIntegerMax/512                   17074 ns        17074 ns        40778 bytes_per_second=283.126M/s
BM_from_charsIntegerMax/4096                 160745 ns       160742 ns         4351 bytes_per_second=241.863M/s
BM_from_charsIntegerMax/32768               1296941 ns      1296894 ns          540 bytes_per_second=240.41M/s
BM_from_charsIntegerMax/100000              3960666 ns      3960637 ns          177 bytes_per_second=240.315M/s

NumberParser is worse than on gcc, but from_chars is dramatically worse. Both of these should be using the v4 toolchain (though one is clang andthe other is gcc)

Comment by Patrick Freed [ 04/Oct/23 ]

So I looked into this a bit more, and it's looking like from_chars isn't a guaranteed win for general purpose number parsing unfortunately, at least not in the version provided in the v4 toolchain's libstdc++. I ran a simple benchmark which converted strings to integers (randomly generated from within the min/max bounds of int), and it actually looks like the existing parser is faster for that use case.

----------------------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------
BM_NumberParserIntegerMax/1                    17.5 ns         17.5 ns     40072621 bytes_per_second=544.335M/s
BM_NumberParserIntegerMax/8                     143 ns          143 ns      4884838 bytes_per_second=545.644M/s
BM_NumberParserIntegerMax/64                   1088 ns         1088 ns       642737 bytes_per_second=559.969M/s
BM_NumberParserIntegerMax/512                 10968 ns        10968 ns        63812 bytes_per_second=440.756M/s
BM_NumberParserIntegerMax/4096               109159 ns       109158 ns         6400 bytes_per_second=356.157M/s
BM_NumberParserIntegerMax/32768              906534 ns       906518 ns          773 bytes_per_second=343.938M/s
BM_NumberParserIntegerMax/100000            2768024 ns      2768040 ns          253 bytes_per_second=343.854M/s
 
BM_from_charsIntegerMax/1                      26.3 ns         26.3 ns     26638762 bytes_per_second=362.938M/s
BM_from_charsIntegerMax/8                       206 ns          206 ns      3396231 bytes_per_second=379.451M/s
BM_from_charsIntegerMax/64                     1572 ns         1572 ns       445476 bytes_per_second=387.759M/s
BM_from_charsIntegerMax/512                   12504 ns        12504 ns        55920 bytes_per_second=386.599M/s
BM_from_charsIntegerMax/4096                 121933 ns       121932 ns         5735 bytes_per_second=318.847M/s
BM_from_charsIntegerMax/32768               1029088 ns      1029072 ns          683 bytes_per_second=302.977M/s
BM_from_charsIntegerMax/100000              3135294 ns      3135152 ns          224 bytes_per_second=303.59M/s

I was able to partially reproduce the results in the ticket description if I changed the benchmark to parsing the string versions of 0 to n (similar to array indices) though:

----------------------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------
BM_NumberParserInteger/1            9.87 ns         9.87 ns     70846793 bytes_per_second=96.5961M/s
BM_NumberParserInteger/8            76.9 ns         76.9 ns      9163216 bytes_per_second=99.1618M/s
BM_NumberParserInteger/64            646 ns          646 ns      1084237 bytes_per_second=174.112M/s
BM_NumberParserInteger/512          5534 ns         5534 ns       126121 bytes_per_second=245.758M/s
BM_NumberParserInteger/4096        46169 ns        46170 ns        15147 bytes_per_second=315.497M/s
BM_NumberParserInteger/32768      393775 ns       393766 ns         1774 bytes_per_second=369.901M/s
BM_NumberParserInteger/100000    1253548 ns      1253536 ns          557 bytes_per_second=371.941M/s
 
BM_from_charsInteger/1              7.48 ns         7.48 ns     93559333 bytes_per_second=127.497M/s
BM_from_charsInteger/8              57.7 ns         57.7 ns     12128269 bytes_per_second=132.156M/s
BM_from_charsInteger/64              563 ns          563 ns      1246038 bytes_per_second=199.962M/s
BM_from_charsInteger/512            5055 ns         5055 ns       138518 bytes_per_second=269.039M/s
BM_from_charsInteger/4096          46354 ns        46351 ns        14598 bytes_per_second=314.26M/s
BM_from_charsInteger/32768        438905 ns       438902 ns         1597 bytes_per_second=331.861M/s
BM_from_charsInteger/100000      1393239 ns      1393207 ns          502 bytes_per_second=334.654M/s

I don't know why this is the case, but I guess from_chars is is better for converting smaller strings/numbers than our existing implementation is.

One thing I noticed is that more recent versions of libstdc++ include performance optimizations for from_chars, which might explain what we're seeing here: https://github.com/gcc-mirror/gcc/commit/a54137c88061c7495728fc6b8dfd0474e812b2cb#diff-8ae63de6891112977cbd61ab5c02264e1dd6dc079ad717416ba03e689bbd1c81 Per the tags on that commit, it looks like we'll need gcc 12.1 to take advantage of them, though, and the v4 toolchain is currently on 11.3.

In light of all this, I think what I would suggest is that rather than updating NumberParser's implementation to use from_chars, we should consider maybe changing the implementation of our BSON array parsing in particular to use from_chars, since its a code path that from_chars is better at. We could also consider just not validating array indexes here at all--there's prior art for this in the BSON corpus driver specification: drivers are intended to be able to parse invalid array keys (https://github.com/mongodb/specifications/blob/master/source/bson-corpus/bson-corpus.rst#testing-validity). If we're okay with the latter, then that would probably be the easiest and fastest.

In the future, we can revisit updating the NumberParser implementation again to see if newer libstdc++ versions yield better results.

benchmark code: https://github.com/patrickfreed/mongo/blob/parse-number-bm-old/src/mongo/base/parse_number_bm.cpp

Comment by Patrick Freed [ 20/Sep/23 ]

This is closely related to SERVER-78480, which is focused on the extra string allocations we need to do during double parsing to ensure the input to strtod is null-terminated. std::from_chars notably does not have this requirement, so I'm updating the double parsing path to use it instead of strtod, and since it looks like there's perf wins to be had from updating the integer parsing path to use std::from_chars too, I'll just go ahead and do both. We can still decide to not validate array indices, though.

matthew.russotto@mongodb.com Can you share the benchmark code you were using to produce these numbers? I want to make sure my updated version matches what you're seeing.

Comment by Billy Donahue [ 19/Sep/23 ]

Really interesting.

I'm surprised std::from_chars is so much faster than NumberParser, which predates it. As an aside, perhaps NumberParser should lean on the new no-locale C++ numeric parse routines more.

I guess the heart of the ticket is whether it is safe to ignore BSON array indices? Seems like it should be ok in practice, but it's kind of a subjective question.

Using std::from_chars approaches the performance of removing the checks.
std::from_chars is adding a factor of about ~1.2x instead of the factor of ~2.5x from NumberParser. NumberParser obviously needs to be much faster.

Our NumberParser uses an internal parseMagnitudeFromStringWithBase, which is doing overflow protection during the parse. Unnecessary in this case, and we should probably have an option to deselect it. We might be suffering performance from returning a StatusWith as well. Exceptions would be a better choice here since array indices are presumably reliably formatted to be valid integers.

Generated at Thu Feb 08 06:45:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.