[SERVER-60715] Access of array fields is slow in SBE queries Created: 14/Oct/21  Updated: 27/Oct/23  Resolved: 26/Sep/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Irina Yatsenko (Inactive) Assignee: Mihai Andrei
Resolution: Gone away Votes: 0
Labels: pm2697-m3, sbe
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File sbe-point-array.svg     File sbe-point-scalar.svg    
Issue Links:
Depends
depends on SERVER-60718 Access of sub-fields is slow in SBE q... Closed
Related
related to SERVER-66457 Large $elemMatch queries are slow in SBE Closed
Operating System: ALL
Sprint: QE 2022-10-03
Participants:
Story Points: 0

 Description   

Summary: accessing an array field is ~15-30% slower in SBE compared to the classical engine.

 

Create a collection that contains 10^6 documents like:

{ scalar: Random.randInt(10), array: [Random.randInt(10), Random.randInt(10), Random.randInt(10)],  }

Let the collection have namespace of "sbe-perf.LS". From mongo shell run against the collection two benchmarks:
 

{{benchRun({parallel: 1, seconds: 5, ops: [{op:"find", ns:"sbe-perf.LS", query: {scalar:7}, readCmd: true}]})}}

results:
"queryLatencyAverageMicros" : 434161.6666666667,
"totalOps" : NumberLong(12),
"totalOps/s" : 2.3032541141876615,
 

benchRun({parallel: 1, seconds: 5, ops: [{op:"find", ns:"sbe-perf.LS", query: {array:7}, readCmd: true}]})

results:
"queryLatencyAverageMicros" : 872580.1666666666,
"totalOps" : NumberLong(6),
"totalOps/s" : 1.1460142483951503,

The second benchRun is twice as slow. While we do expect the array access to be slower, the drop in SBE is much bigger than in the classic engine. For the classic engine the numbers are respectively:

"queryLatencyAverageMicros" : 419583.25,
"totalOps" : NumberLong(12),
"totalOps/s" : 2.3832732351265875,

and

"queryLatencyAverageMicros" : 657760,
"totalOps" : NumberLong(8),
"totalOps/s" : 1.5202920024849174,

 

The top 10 consumers of CPU in SBE

Scalar
+   15.99%  mongod              [.] mongo::sbe::vm::ByteCode::runInternal
{{ +    8.10%  mongod              [.] __wt_txn_read_upd_list_internal}}
{{ +    7.47%  mongod              [.] __wt_btcur_next_prefix}}
{{ +    4.92%  mongod              [.] mongo::sbe::ProjectStage::getNext}}
{{ +    4.08%  mongod              [.] mongo::WiredTigerRecordStoreCursorBase::next}}
{{ +    3.70%  mongod              [.] __wt_cursor_get_valuev}}
{{ +    3.42%  mongod              [.] mongo::sbe::vm::ByteCode::run}}
{{ +    3.38%  mongod              [.] mongo::sbe::ScanStage::getNext}}
{{ +    3.34%  mongod              [.] __unpack_read}}
{{ +    3.03%  mongod              [.] __wt_cursor_get_keyv}}

 
Array
+   21.13%  mongod              [.] mongo::sbe::vm::ByteCode::runInternal
+    5.97%  mongod              [.] mongo::sbe::vm::ByteCode::run
+    4.49%  mongod              [.] mongo::sbe::ProjectStage::getNext
+    4.44%  libc-2.27.so        [.] __strlen_avx2
+    4.32%  mongod              [.] mongo::sbe::TraverseStage::traverse
+    4.27%  mongod              [.] __wt_btcur_next_prefix
+    3.93%  mongod              [.] __wt_txn_read_upd_list_internal
+    2.40%  mongod              [.] mongo::sbe::vm::ByteCode::getField
+    2.28%  mongod              [.] __wt_cursor_get_valuev
+    2.02%  mongod              [.] mongo::WiredTigerRecordStoreCursorBase::next

Flame graphs are attached.

 



 Comments   
Comment by David Storch [ 26/Sep/22 ]

mihai.andrei@mongodb.com do we have a benchmark like this running in Evergreen anywhere? We have Queries.WildcardIndex.PointQueryOnSingleArrayField* which tests array traversal, but I'm not sure that we have a simple test with a sizeable data set that intends to emphasize array traversal. Are you aware of any pre-existing test like this? If not, I'm not sure which line item in the design document would cover the addition of this new benchmark.

Comment by Kyle Suarez [ 26/Sep/22 ]

Nice work martin.neupauer@mongodb.com; I'm gonna mark this as being fixed by SERVER-60718.

Comment by Kyle Suarez [ 21/Sep/22 ]

Assigning to mihai.andrei@mongodb.com to confirm we've made progress here. If there's still a regression – at the risk of building up the "hype train" – please try applying Martin's patch?

Comment by Irina Yatsenko (Inactive) [ 07/Jan/22 ]

The plan for a single field point query (whether the field is an array or not):
> db.arrays.find(

{a16:17}

).explain().queryPlanner.winningPlan.slotBasedPlan.stages
[1] filter

{fillEmpty (s8, false)}

[1] traverse s8 s7 s6 [s4, s5]

{s8 || s7}

{s8}

from
[1] project [s6 = getField (s4, "a16")]
[1] scan s4 s5 none none none none [] @"1b70f805-2a10-4957-88a5-01b559d5c1a5" true false
in
[1] project [s7 = fillEmpty (s6 == 17, false)]
[1] limit 1
[1] coscan

Comment by Irina Yatsenko (Inactive) [ 19/Nov/21 ]

Collected hardware perf counters for accessing array fields in a dataset with 10^6 records, where each record has the following schema: 

{a0: int, a1: [int], a4:[int, int, int, int], a8:[8 ints], a16:[16 ints]}

. The integers are randomly generated in the range [0;10) with a query like find({x:17}), where "x" is one of the aN field. By the choice of values, the query returns no results and has to travers all arrays in an aN field fully. No indexes used.

611.75 msec task-clock # 0.186 CPUs utilized
178 context-switches # 0.291 K/sec
0 cpu-migrations # 0.000 K/sec
0 page-faults # 0.000 K/sec
1,739,733,925 cycles # 2.844 GHz
3,612,625,207 instructions # 2.08 insn per cycle
693,236,064 branches # 1133.209 M/sec
813,732 branch-misses # 0.12% of all branches

708.39 msec task-clock # 0.230 CPUs utilized
188 context-switches # 0.265 K/sec
0 cpu-migrations # 0.000 K/sec
0 page-faults # 0.000 K/sec
2,095,617,912 cycles # 2.958 GHz
4,384,588,252 instructions # 2.09 insn per cycle
851,208,796 branches # 1201.619 M/sec
1,136,682 branch-misses # 0.13% of all branches

1,234.99 msec task-clock # 0.342 CPUs utilized
298 context-switches # 0.241 K/sec
0 cpu-migrations # 0.000 K/sec
0 page-faults # 0.000 K/sec
3,762,001,254 cycles # 3.046 GHz
8,883,850,249 instructions # 2.36 insn per cycle
1,705,220,567 branches # 1380.757 M/sec
5,253,857 branch-misses # 0.31% of all branches

1,894.53 msec task-clock # 0.409 CPUs utilized
451 context-switches # 0.238 K/sec
0 cpu-migrations # 0.000 K/sec
1 page-faults # 0.001 K/sec
5,846,699,734 cycles # 3.086 GHz
14,827,692,021 instructions # 2.54 insn per cycle
2,833,922,808 branches # 1495.848 M/sec
9,336,111 branch-misses # 0.33% of all branches

4,076.74 msec task-clock # 0.586 CPUs utilized
945 context-switches # 0.232 K/sec
0 cpu-migrations # 0.000 K/sec
0 page-faults # 0.000 K/sec
10,593,588,736 cycles # 2.599 GHz
26,444,206,767 instructions # 2.50 insn per cycle
5,022,740,663 branches # 1232.048 M/sec
18,697,474 branch-misses # 0.37% of all branches

The stats show a linear dependency of the number of instructions on the size of the array at ~1450 instructions per element. Measuring the same queries in classic engine yields similar IPC values for all array numbers but much lower overall instruction counts (for 16-elements array seeing ~9,296M instructions overall compared to 26,444M in SBE). The dependency of the number of instructions on the size of array in classical engine is also linear but at ~320 instructions per element.

Given the flamegraphs, I believe most of the overhead in the SBE engine is due to the VM.

Comment by Kyle Suarez [ 22/Oct/21 ]

ethan.zhang, eric.cox and irina.yatsenko, we are sending these SBE performance issues to the $group epic. Let us know if you think it belongs in a separate project.

Generated at Thu Feb 08 05:50:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.