[SERVER-81098] Make statementId in oplog faster Created: 15/Sep/23  Updated: 24/Jan/24  Resolved: 11/Nov/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.3.0-rc0

Type: Improvement Priority: Major - P3
Reporter: Matthew Russotto Assignee: Matthew Russotto
Resolution: Fixed Votes: 0
Labels: perf-8.0, perf-tiger, perf-tiger-handoff, perf-tiger-q4
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
causes SERVER-83198 createOplogEntryWithNStatementIds use... Closed
Related
Assigned Teams:
Replication
Backwards Compatibility: Fully Compatible
Sprint: Repl 2023-11-13
Participants:
Linked BF Score: 155

 Description   

For historical reasons the statement ID in the oplog can be absent, a single integer, or an array of integers. We currently represent this in the IDL class with a boost::optional<std::variant<int, vector<int>>>, which is inconvenient and inefficient. We have accessors which convert this to vector<int>, which makes it more convenient but means we do a heap allocation every time we access statement ID.

Instead of doing this, we should represent the value in the IDL class with a vector or boost_smallvector, and the accessors can provide references. On serialization we can convert to absent/BSON integer/BSON vector depending on the number of values in the vector (we currently do this in the convenience layer).



 Comments   
Comment by Githook User [ 11/Nov/23 ]

Author:

{'name': 'Matthew Russotto', 'email': 'matthew.russotto@mongodb.com', 'username': 'mtrussotto'}

Message: SERVER-81098 Make statementId in oplog_entry IDL faster
Branch: master
https://github.com/mongodb/mongo/commit/059e3ed3b43d8c1408382accb9cf010da87dd8d3

Comment by Matthew Russotto [ 09/Nov/23 ]

CPU Caches:
  L1 Data 64 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 32768 KiB (x1)
Load Average: 5.14, 3.10, 2.28
-----------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations
-----------------------------------------------------------------------------------
BM_ParseOplogEntryWithNStatementIds/0           538 ns          538 ns      1297033
BM_ParseOplogEntryWithNStatementIds/0           537 ns          537 ns      1298677
BM_ParseOplogEntryWithNStatementIds/1           579 ns          579 ns      1209259
BM_ParseOplogEntryWithNStatementIds/8           735 ns          735 ns       946716
BM_ParseOplogEntryWithNStatementIds/64         1512 ns         1511 ns       463196
BM_ParseOplogEntryWithNStatementIds/512        7938 ns         7937 ns        88170
BM_ParseOplogEntryWithNStatementIds/1024      15349 ns        15349 ns        45595
BM_AccessStatementIdsForOplogEntry/0           1.60 ns         1.60 ns    436712438
BM_AccessStatementIdsForOplogEntry/0           1.60 ns         1.60 ns    436667600
BM_AccessStatementIdsForOplogEntry/1           2.00 ns         2.00 ns    349365510
BM_AccessStatementIdsForOplogEntry/8           8.82 ns         8.82 ns     79396648
BM_AccessStatementIdsForOplogEntry/64          53.7 ns         53.7 ns     13031611
BM_AccessStatementIdsForOplogEntry/512          418 ns          418 ns      1676564
BM_AccessStatementIdsForOplogEntry/1024         828 ns          828 ns       845108
BM_GetDurableReplOperationSize/0               5.58 ns         5.58 ns    124832515
BM_GetDurableReplOperationSize/1               5.58 ns         5.58 ns    124866662
BM_GetDurableReplOperationSize/1024            5.59 ns         5.59 ns    125189611

Parsing is slightly improved, access (which reads every element in the statementID array) is somewhat improved, and GetDurableReplOperationSize becomes O(1) as it should be.

The small amount of improvement in access likely means the slowest part was bringing the data into L1 cache; after doing the unnecessary copy, the data was already in L1 cache and so was very fast to access.

Comment by Matthew Russotto [ 08/Nov/23 ]

"before" benchmark on 8.0 (ARM virtual workstation; very approximate)

Run on (8 X 243.75 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 32768 KiB (x1)
Load Average: 2.51, 1.86, 1.75
-----------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations
-----------------------------------------------------------------------------------
BM_ParseOplogEntryWithNStatementIds/0           578 ns          578 ns      1208541
BM_ParseOplogEntryWithNStatementIds/0           577 ns          577 ns      1211796
BM_ParseOplogEntryWithNStatementIds/1           627 ns          627 ns      1116443
BM_ParseOplogEntryWithNStatementIds/8           883 ns          883 ns       790502
BM_ParseOplogEntryWithNStatementIds/64         1855 ns         1855 ns       377512
BM_ParseOplogEntryWithNStatementIds/512        9180 ns         9181 ns        76265
BM_ParseOplogEntryWithNStatementIds/1024      17483 ns        17483 ns        39996
BM_AccessStatementIdsForOplogEntry/0           3.70 ns         3.70 ns    189387977
BM_AccessStatementIdsForOplogEntry/0           3.70 ns         3.70 ns    189185313
BM_AccessStatementIdsForOplogEntry/1           18.2 ns         18.2 ns     38462496
BM_AccessStatementIdsForOplogEntry/8           37.8 ns         37.8 ns     18546032
BM_AccessStatementIdsForOplogEntry/64          78.0 ns         78.0 ns      8973191
BM_AccessStatementIdsForOplogEntry/512          406 ns          406 ns      1722758
BM_AccessStatementIdsForOplogEntry/1024         747 ns          747 ns       929811
BM_GetDurableReplOperationSize/0               6.01 ns         6.01 ns    116441444
BM_GetDurableReplOperationSize/1               20.1 ns         20.1 ns     34899806
BM_GetDurableReplOperationSize/1024             246 ns          246 ns      2843034

Note GetDurableOperationSize; this should be a fast O(1) operation but it turns out it is not.

The ParseOplogEntry benchmarks were already vastly improved by SERVER-81191, so I don't expect much more here.

Comment by Matthew Russotto [ 07/Nov/23 ]

It turns out IDL isn't using SERVER-78439 for enums. Trying that and making command type an enum improved times, but that will have to go to another ticket.

BM_ParseOplogEntryWithNStatementIds/0 440 ns 440 ns 1622675

Comment by Matthew Russotto [ 07/Nov/23 ]

Just parsing the command is a bit expensive.
BM_ParseOplogEntryWithNStatementIds/0 595 ns 595 ns 1175837
skipping that:
BM_ParseOplogEntryWithNStatementIds/0 465 ns 465 ns 1505622
(when the oplog entry has the very last command in the list)

If I make that an IDL enum it will take advantage of SERVER-78439, so I'll do that while I'm at it.

Comment by Matthew Russotto [ 18/Sep/23 ]

For instance, exclusive of statement ID checking (a different bug in SERVER-81101), this is a benchmark for TransactionOperation:::addOperations:
Before

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_AddOperations/1            736 ns          739 ns       680419
BM_AddOperations/8           1532 ns         1535 ns       455206
BM_AddOperations/64         10805 ns        10829 ns        64185
BM_AddOperations/512        92698 ns        92809 ns         5812
BM_AddOperations/4096      688478 ns       688644 ns          994
BM_AddOperations/10000    2407828 ns      2408314 ns          341

After

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_AddOperations/1            721 ns          723 ns       692374
BM_AddOperations/8           1418 ns         1421 ns       493558
BM_AddOperations/64          9466 ns         9485 ns        72879
BM_AddOperations/512        75986 ns        76111 ns         9095
BM_AddOperations/4096      598413 ns       598610 ns         1161
BM_AddOperations/10000    1862150 ns      1863212 ns          380

I believe the reason is a call to get statement IDs in getDurableReplOperationSize creates a copy.

Comment by Matthew Russotto [ 15/Sep/23 ]

The new code after I fixed an issue with doing an unnecessary floating point conversion:

--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_ParseOplogEntryWithNoStatementId               6202 ns         6202 ns       112637
BM_ParseOplogEntryWithOneStatementId              7585 ns         7585 ns        91545
BM_ParseOplogEntryWithMultiStatementId/2          8386 ns         8386 ns        83363
BM_ParseOplogEntryWithMultiStatementId/8         10065 ns        10065 ns        69569
BM_ParseOplogEntryWithMultiStatementId/64        20588 ns        20588 ns        33992
BM_ParseOplogEntryWithMultiStatementId/512       97911 ns        97910 ns         7148
BM_ParseOplogEntryWithMultiStatementId/1000     181393 ns       181390 ns         3858

Same code, also verifying field IDs using NumberParser (like IDL normally does)

--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_ParseOplogEntryWithNoStatementId               6387 ns         6387 ns        79921
BM_ParseOplogEntryWithOneStatementId              7386 ns         7386 ns        94743
BM_ParseOplogEntryWithMultiStatementId/2          9176 ns         9175 ns        75905
BM_ParseOplogEntryWithMultiStatementId/8         12406 ns        12406 ns        56820
BM_ParseOplogEntryWithMultiStatementId/64        38056 ns        38055 ns        18473
BM_ParseOplogEntryWithMultiStatementId/512      281326 ns       281323 ns         2501
BM_ParseOplogEntryWithMultiStatementId/1000     548234 ns       548228 ns         1250

Same code, verifying field IDs using std::from_chars

--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_ParseOplogEntryWithNoStatementId               6701 ns         6701 ns        79579
BM_ParseOplogEntryWithOneStatementId              7597 ns         7596 ns        92741
BM_ParseOplogEntryWithMultiStatementId/2          8764 ns         8764 ns        80098
BM_ParseOplogEntryWithMultiStatementId/8         11093 ns        11092 ns        53907
BM_ParseOplogEntryWithMultiStatementId/64        29056 ns        29055 ns        23022
BM_ParseOplogEntryWithMultiStatementId/512      184330 ns       184327 ns         4101
BM_ParseOplogEntryWithMultiStatementId/1000     324621 ns       324614 ns         2144

So there may be some gains to be had with the IDL elsewhere by not verifying array IDs or verifying them faster. But also the new code is strictly faster than the old in the normal case (0-1 statement ids). Need to try boost::small_vector.

Comment by Matthew Russotto [ 15/Sep/23 ]

Same benchmark, current code

--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_ParseOplogEntryWithNoStatementId               6901 ns         6899 ns        77255
BM_ParseOplogEntryWithOneStatementId              7617 ns         7614 ns        91311
BM_ParseOplogEntryWithMultiStatementId/2         11965 ns        11965 ns        58241
BM_ParseOplogEntryWithMultiStatementId/8         15358 ns        15355 ns        45163
BM_ParseOplogEntryWithMultiStatementId/64        39917 ns        39916 ns        17548
BM_ParseOplogEntryWithMultiStatementId/512      276858 ns       276801 ns         2513
BM_ParseOplogEntryWithMultiStatementId/1000     548410 ns       548374 ns         1273

There's clearly some noise; the NoStatementId version isn't doing anything different.

Generated at Thu Feb 08 06:45:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.