[SERVER-60721] Projection of computed fields is slow in SBE queries Created: 14/Oct/21  Updated: 27/Oct/23  Resolved: 19/Jan/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Irina Yatsenko (Inactive) Assignee: Backlog - Query Execution
Resolution: Gone away Votes: 0
Labels: pm2697-m3, sbe
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File sbe-project-arith.svg     File sbe-project-noarith.svg    
Issue Links:
Depends
Assigned Teams:
Query Execution
Operating System: ALL
Participants:
Story Points: 5

 Description   

Summary: projection that involves calculating new fields is slower in SBE compared to classical by ~30%.

 

Create a collection with 10^6 documents that have four numeric fields a, b, c, d (make "c" non-zero so $divide doesn't throw)

let projectNoArith = {
    a1: "$a", b1: "$b", c1: "$c", d1: "$d",
    a2: "$a", b2: "$b", c2: "$c", d2: "$d",
} 

let projectWithArith = {
    an: {$abs: "$a"}, bn: {$mod: ["$b", 17]}, cn: {$floor: "$c"},
    dl: {$ln: {$add: [{$abs: "$d"}, 1]}},
    ab: {$add: ["$a", "$b"]}, cd: {$divide: ["$d", "$c"]},
} 

 
Run the following two benchmarks.

benchRun({parallel: 1, seconds: 5, ops: [{op:"find", ns:"sbe-perf.LS", query:{}, filter: projectNoArith, readCmd: true}]})

results
"queryLatencyAverageMicros" : 1366528.75,
"totalOps" : NumberLong(4),
"totalOps/s" : 0.7317748723876095,
 

benchRun({parallel: 1, seconds: 5, ops: [{op:"find", ns:"sbe-perf.LS", query:{}, filter: projectWithArith, readCmd: true}]})

results
"queryLatencyAverageMicros" : 3294403.5,
"totalOps" : NumberLong(2),
"totalOps/s" : 0.3035431529846867,
 
The same benchmarks in the classical engine produce the following results respectively:
"queryLatencyAverageMicros" : 2274520,
"totalOps" : NumberLong(3),
"totalOps/s" : 0.4396507531657053,
and
"queryLatencyAverageMicros" : 2406675.3333333335,
"totalOps" : NumberLong(3),
"totalOps/s" : 0.4155084944479063,
 
SBE plan for projectNoArith

[2] traverse s15 s14 s4 [s5] {} {}
 from
 [1] scan s4 s5 none none none none [] @"c494dfc1-7ed7-45e7-a46d-b253a1e532db" true false
 in
 [2] mkbson s14 s4 [_id] keep [a1 = s6, b1 = s7, c1 = s8, d1 = s9, a2 = s10, b2 = s11, c2 = s12, d2 = s13] true false
 [2] project [s13 = getField (s4, "d")]
 [2] project [s12 = getField (s4, "c")]
 [2] project [s11 = getField (s4, "b")]
 [2] project [s10 = getField (s4, "a")]
 [2] project [s9 = getField (s4, "d")]
 [2] project [s8 = getField (s4, "c")]
 [2] project [s7 = getField (s4, "b")]
 [2] project [s6 = getField (s4, "a")]
 [2] limit 1
 [2] coscan

 
SBE plan for projectWithArith

[2] traverse s21 s20 s4 [s5] {} {}
 from
 [1] scan s4 s5 none none none none [] @"c494dfc1-7ed7-45e7-a46d-b253a1e532db" true false
 in
 [2] mkbson s20 s4 [_id] keep [an = s7, bn = s9, cn = s11, dl = s13, ab = s16, cd = s19] true false
 [2] project [s7 = let [l1.0 = s6] if (! exists (l1.0) || typeMatch (l1.0, 0x00000440), null, if (! isNumber (l1.0), fail ( 4903700 ,$abs only supports numeric types), if (typeMatch (l1.0, 0x00040000) && l1.0 == -9223372036854775808, fail ( 4903701 ,can't take $abs of long long min), abs (l1.0)))), s9 = let [l2.0 = s8, l2.1 = 17] if (! exists (l2.0) || typeMatch (l2.0, 0x00000440) || ! exists (l2.1) || typeMatch (l2.1, 0x00000440), null, if (! isNumber (l2.0) || ! isNumber (l2.1), fail ( 5154000 ,$mod only supports numeric types), mod (l2.0, if (typeMatch (l2.1, 0x00000002) && ! typeMatch (l2.0, 0x00000002), fillEmpty (convert ( l2.1, int32), l2.1), l2.1)))), s11 = let [l3.0 = s10] if (! exists (l3.0) || typeMatch (l3.0, 0x00000440), null, if (! isNumber (l3.0), fail ( 4903704 ,$floor only supports numeric types), floor (l3.0))), s13 = let [l7.0 = let [l5.0 = let [l4.0 = s12] if (! exists (l4.0) || typeMatch (l4.0, 0x00000440), null, if (! isNumber (l4.0), fail ( 4903700 ,$abs only supports numeric types), if (typeMatch (l4.0, 0x00040000) && l4.0 == -9223372036854775808, fail ( 4903701 ,can't take $abs of long long min), abs (l4.0)))), l5.1 = 1] let [l6.0 = isDate (l5.0), l6.1 = isDate (l5.1)] if (! exists (l5.0) || typeMatch (l5.0, 0x00000440) || ! exists (l5.1) || typeMatch (l5.1, 0x00000440), null, if (! isNumber (l5.0) && ! isDate (l5.0) || ! isNumber (l5.1) && ! isDate (l5.1), fail ( 4974201 ,only numbers and dates are allowed in an $add expression), if (l6.0 && l6.1, fail ( 4974202 ,only one date allowed in an $add expression), if (l6.0 || l6.1, doubleDoubleSum (l5.0, l5.1), l5.0 + l5.1))))] if (! exists (l7.0) || typeMatch (l7.0, 0x00000440), null, if (! isNumber (l7.0), fail ( 4903705 ,$ln only supports numeric types), if (isNaN (l7.0), convert ( l7.0, double), if (l7.0 <= 0, fail ( 4903706 ,$ln's argument must be a positive number), ln (l7.0))))), s16 = let [l8.0 = s14, l8.1 = s15] let [l9.0 = isDate (l8.0), l9.1 = isDate (l8.1)] if (! exists (l8.0) || typeMatch (l8.0, 0x00000440) || ! exists (l8.1) || typeMatch (l8.1, 0x00000440), null, if (! isNumber (l8.0) && ! isDate (l8.0) || ! isNumber (l8.1) && ! isDate (l8.1), fail ( 4974201 ,only numbers and dates are allowed in an $add expression), if (l9.0 && l9.1, fail ( 4974202 ,only one date allowed in an $add expression), if (l9.0 || l9.1, doubleDoubleSum (l8.0, l8.1), l8.0 + l8.1)))), s19 = let [l10.0 = s17, l10.1 = s18] if (! exists (l10.0) || typeMatch (l10.0, 0x00000440) || ! exists (l10.1) || typeMatch (l10.1, 0x00000440), null, if (isNumber (l10.0) && isNumber (l10.1), l10.0 / l10.1, fail ( 5073101 ,$divide only supports numeric types)))]
 [2] project [s18 = getField (s4, "c")]
 [2] project [s17 = getField (s4, "d")]
 [2] project [s15 = getField (s4, "b")]
 [2] project [s14 = getField (s4, "a")]
 [2] project [s12 = getField (s4, "d")]
 [2] project [s10 = getField (s4, "c")]
 [2] project [s8 = getField (s4, "b")]
 [2] project [s6 = getField (s4, "a")]
 [2] limit 1
 [2] coscan

 Top CPU consumers in SBE mode for projectNoArith

+ 13.82% mongod [.] mongo::sbe::vm::ByteCode::runInternal
+ 7.10% libc-2.27.so [.] __strlen_avx2
+ 6.03% mongod [.] mongo::sbe::vm::ByteCode::getField
+ 5.95% mongod [.] mongo::sbe::ProjectStage::getNext
+ 4.23% mongod [.] mongo::sbe::vm::ByteCode::run
+ 4.02% mongod [.] mongo::sbe::MakeObjStageBase<(mongo::sbe::MakeObjOutputType)1>::produceObject
+ 3.44% mongod [.] mongo::sbe::bson::advance
+ 3.22% mongod [.] mongo::BSONObjBuilderBase<mongo::UniqueBSONObjBuilder, mongo::UniqueBufBuilder>::append<double, void>
+ 2.83% mongod [.] mongo::(anonymous namespace)::GetMoreCmd::Invocation::acquireLocksAndIterateCursor
+ 2.68% mongod [.] mongo::sbe::ProjectStage::open
+ 2.43% mongod [.] __wt_btcur_next_prefix
+ 2.43% mongod [.] mongo::BasicBufBuilder<mongo::UniqueBufferAllocator>::appendStr
+ 2.43% mongod [.] mongo::sbe::bson::appendValueToBsonObj<mongo::UniqueBSONObjBuilder>
+ 1.70% libc-2.27.so [.] __memcmp_avx2_movbe
+ 1.53% libc-2.27.so [.] __memmove_avx_unaligned_erms
+ 1.49% mongod [.] __unpack_read
+ 1.41% mongod [.] mongo::BSONObjBuilderBase<mongo::UniqueBSONObjBuilder, mongo::UniqueBufBuilder>::_done
+ 1.36% mongod [.] mongo::WiredTigerRecordStoreCursorBase::next

Top CPU consumers in SBE more for projectWithArith

+ 52.84% mongod [.] mongo::sbe::vm::ByteCode::runInternal
+ 3.70% libc-2.27.so [.] __strlen_avx2
+ 3.22% mongod [.] mongo::sbe::ProjectStage::getNext
+ 3.07% mongod [.] mongo::sbe::vm::ByteCode::swapStack
+ 2.52% mongod [.] mongo::sbe::vm::ByteCode::getField
+ 2.49% mongod [.] mongo::sbe::vm::ByteCode::run
+ 1.79% mongod [.] mongo::sbe::MakeObjStageBase<(mongo::sbe::MakeObjOutputType)1>::produceObject
+ 1.68% mongod [.] mongo::sbe::ProjectStage::open
+ 1.48% mongod [.] mongo::sbe::bson::advance
+ 1.21% mongod [.] mongo::(anonymous namespace)::GetMoreCmd::Invocation::acquireLocksAndIterateCursor
+ 1.20% mongod [.] mongo::BSONObjBuilderBase<mongo::UniqueBSONObjBuilder, mongo::UniqueBufBuilder>::append<double, void>
+ 1.20% mongod [.] __wt_btcur_next_prefix
+ 0.93% libm-2.27.so [.] __ieee754_log_fma
+ 0.80% mongod [.] mongo::PlanExecutorSBE::getNext
+ 0.80% mongod [.] mongo::sbe::value::tagToType
+ 0.78% mongod [.] mongo::sbe::value::OwnedValueAccessor::getViewOfValue
+ 0.73% mongod [.] mongo::BasicBufBuilder<mongo::UniqueBufferAllocator>::appendStr
+ 0.69% libm-2.27.so [.] __fmod_finite
+ 0.68% mongod [.] mongo::sbe::bson::appendValueToBsonObj<mongo::UniqueBSONObjBuilder>
+ 0.65% mongod [.] __unpack_read
+ 0.63% libc-2.27.so [.] __memcmp_avx2_movbe
+ 0.61% mongod [.] mongo::sbe::vm::ByteCode::dispatchBuiltin
+ 0.59% mongod [.] mongo::WiredTigerRecordStoreCursorBase::next
 
  Flamegraphs are attached



 Comments   
Comment by Irina Yatsenko (Inactive) [ 24/Nov/21 ]

Collected perf stats on a dataset of 10^6 documents with an integer scalar field a0 in range [0, 9] for the query in the following form with varied number of computed pN fields.

aggregate({$project: {p1: {$add: ["$a0", 1]}}}, {$match: {p1: 17}})

The query currently doesn't get fully lowered into SBE and keeps $match as a separate stage. The SBE part of the plan (for a query with two computed fields) looks like:

[2] traverse s11 s10 s4 [s5] {} {}
from
[1] scan s4 s5 none none none none [] @"1b70f805-2a10-4957-88a5-01b559d5c1a5" true false
in
[2] mkbson s10 s4 [_id] keep [p1 = s7, p2 = s9] true false
[2] project [s7 = let [l1.0 = s6, l1.1 = 1] let [l2.0 = isDate (l1.1), l2.1 = isDate (l1.0)] if (! exists (l1.0) || typeMatch (l1.0, 0x00000440) || ! exists (l1.1) || typeMatch (l1.1, 0x00000440), null, if (! isNumber (l1.0) && ! isDate (l1.0) || ! isNumber (l1.1) && ! isDate (l1.1), fail ( 4974201 ,only numbers and dates are allowed in an $add expression), if (l2.1 && l2.0, fail ( 4974202 ,only one date allowed in an $add expression), if (l2.1 || l2.0, doubleDoubleSum (l1.0, l1.1), l1.0 + l1.1)))), s9 = let [l3.0 = s8, l3.1 = 2] let [l4.0 = isDate (l3.1), l4.1 = isDate (l3.0)] if (! exists (l3.0) || typeMatch (l3.0, 0x00000440) || ! exists (l3.1) || typeMatch (l3.1, 0x00000440), null, if (! isNumber (l3.0) && ! isDate (l3.0) || ! isNumber (l3.1) && ! isDate (l3.1), fail ( 4974201 ,only numbers and dates are allowed in an $add expression), if (l4.1 && l4.0, fail ( 4974202 ,only one date allowed in an $add expression), if (l4.1 || l4.0, doubleDoubleSum (l3.0, l3.1), l3.0 + l3.1))))]
[2] project [s8 = getField (s4, "a0")]
[2] project [s6 = getField (s4, "a0")]
[2] limit 1
[2] coscan

Prof stats results

1 computed field

1,483.80 msec task-clock # 0.376 CPUs utilized
360 context-switches # 0.243 K/sec
0 cpu-migrations # 0.000 K/sec
34,952 page-faults # 0.024 M/sec
4,562,117,435 cycles # 3.075 GHz
10,098,925,016 instructions # 2.21 insn per cycle
1,860,237,735 branches # 1253.695 M/sec
7,311,130 branch-misses # 0.39% of all branches

2 computed fields

1,932.32 msec task-clock # 0.476 CPUs utilized
456 context-switches # 0.236 K/sec
0 cpu-migrations # 0.000 K/sec
30,621 page-faults # 0.016 M/sec
5,994,006,901 cycles # 3.102 GHz
14,008,480,223 instructions # 2.34 insn per cycle
2,503,866,512 branches # 1295.784 M/sec
12,847,545 branch-misses # 0.51% of all branches

4 computed fields

2,832.21 msec task-clock # 0.571 CPUs utilized
659 context-switches # 0.233 K/sec
0 cpu-migrations # 0.000 K/sec
33,767 page-faults # 0.012 M/sec
8,869,595,435 cycles # 3.132 GHz
21,811,243,826 instructions # 2.46 insn per cycle
3,788,936,870 branches # 1337.802 M/sec
24,957,628 branch-misses # 0.66% of all branches

8 computed fields

4,715.73 msec task-clock # 0.649 CPUs utilized
1,070 context-switches # 0.227 K/sec
0 cpu-migrations # 0.000 K/sec
88,978 page-faults # 0.019 M/sec
14,899,177,367 cycles # 3.159 GHz
37,485,872,534 instructions # 2.52 insn per cycle
6,376,840,802 branches # 1352.249 M/sec
44,722,770 branch-misses # 0.70% of all branches

The stats show liner dependency between the number of instructions and the number of computed fields at ~3900 instructions per field. The IPC numbers are similar in these scenarios for classical and SBE engines, but the classical engine seems to do much less work. For example, for a query with 8 computed fields the stats for classical engine are:

2,116.01 msec task-clock # 0.486 CPUs utilized
492 context-switches # 0.233 K/sec
0 cpu-migrations # 0.000 K/sec
30,568 page-faults # 0.014 M/sec
6,561,433,027 cycles # 3.101 GHz
16,081,883,763 instructions # 2.45 insn per cycle
2,904,774,760 branches # 1372.758 M/sec
2,992,870 branch-misses # 0.10% of all branches

Note: In the future we expect the optimizer to elide repetitive access to the same fields (getField (s4, "a0")]) but the point here is to demonstrate the costs of computing the expressions so the same source field was used for convenience only.

Comment by Kyle Suarez [ 22/Oct/21 ]

ethan.zhang, eric.cox and irina.yatsenko, we are sending these SBE performance issues to the $group epic. Let us know if you think it belongs in a separate project.

Generated at Thu Feb 08 05:50:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.