[SERVER-53036] [SBE] Investigate why several sys-perf benchmarks fail Created: 23/Nov/20  Updated: 27/Oct/23  Resolved: 11/May/21

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Drew Paroski Assignee: Backlog - Query Execution
Resolution: Gone away Votes: 0
Labels: post-rc0, qexec-team, sbe-post-rc0
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-54423 Re-run the sys-perf/bestbuy benchmarks Closed
Duplicate
is duplicated by SERVER-53074 [SBE] Add support for $map aggregatio... Closed
Related
is related to SERVER-51655 Investigate sys-perf benchmark perfor... Closed
Assigned Teams:
Query Execution
Sprint: Query Execution 2021-05-31
Participants:

 Description   

In December 2020 / January 2021, I made a diff that made SBE mode enabled by default and then I tried running all of the sys-perf benchmarks. Diff:

diff --git a/src/mongo/db/query/query_knobs.idl b/src/mongo/db/query/query_knobs.idl
index 52651caa6f..f6653beb03 100644
--- a/src/mongo/db/query/query_knobs.idl
+++ b/src/mongo/db/query/query_knobs.idl
@@ -410,7 +410,7 @@ server_parameters:
     set_at: [ startup, runtime ]
     cpp_varname: "internalQueryEnableSlotBasedExecutionEngine"
     cpp_vartype: AtomicWord<bool>
- default: false
+ default: true
 
   internalQueryDefaultDOP:
     description: "Default degree of parallelism. This an internal experimental parameter and should not be changed on live systems."

sys-perf Evergreen run: https://spruce.mongodb.com/version/5fb8666657e85a0819d36b59/tasks

Under SBE mode, for the "industry_benchmarks", "industry_benchmarks_wmajority", "linkbench", and "ycsb_60GB" tasks, the loading phase succeeds ("ycsb_load" or "linkbench_load"), but then the subsequent phase fails ("ycsb_100read", "ycsb_95read5update_w_majority", or "linkbench_request").

In the main Evergreen log for each of these tasks, I see the following error:

[2020/11/21 02:13:34.496] 02:13:34Z>  about to fork child process, waiting until server is ready for connections.
[2020/11/21 02:13:35.721] 02:13:34Z>  forked process: 35542
[2020/11/21 02:13:35.721] 02:13:35Z>  ERROR: child process failed, exited with 4

When I looked at the "mongod.log" file for the phase that failed ("ycsb_100read", "ycsb_95read5update_w_majority", or "linkbench_request"), it appears that mongod was encountering an error during startup, and then the mongod would give up and shutdown. Here is a snippet from the "mongod.log" file:

{"t":{"$date":"2020-11-21T02:51:19.012+00:00"},"s":"E", "c":"CONTROL", "id":20539, "ctx":"initandlisten","msg":"Failed to verify auth schema version","attr":{"minSchemaVersion":3,"error":{"code":13436,"codeName":"NotPrimaryOrSecondary","errmsg":"not master or secondary; cannot currently read from this replSet member"}}}
{"t":{"$date":"2020-11-21T02:51:19.012+00:00"},"s":"I", "c":"CONTROL", "id":20540, "ctx":"initandlisten","msg":"To manually repair the 'authSchema' document in the admin.system.version collection, start up with --setParameter startupAuthSchemaValidation=false to disable validation"}
{"t":{"$date":"2020-11-21T02:51:19.012+00:00"},"s":"I", "c":"REPL", "id":4784900, "ctx":"initandlisten","msg":"Stepping down the ReplicationCoordinator for shutdown","attr":{"waitTimeMillis":15000}}
..
{"t":{"$date":"2020-11-21T02:51:19.140+00:00"},"s":"I", "c":"CONTROL", "id":23138, "ctx":"initandlisten","msg":"Shutting down","attr":{"exitCode":4}}

The goal of this task is to investigate and understand precisely why mongod is hitting the "not master or secondary; cannot currently read from this replSet member" error during startup for these benchmarks, to develop a repro that can be done on a developer's local machine, and to open a new task (or update this task) with these findings.

Below are links to the Evergreen runs for each of the failing benchmarks:

industry_benchmarks: https://spruce.mongodb.com/task/sys_perf_linux_1_node_replSet_industry_benchmarks_patch_73d1a6f368b04161dce7c0afbcea23efb52e2070_5fb8666657e85a0819d36b59_20_11_21_01_00_30/

industry_benchmarks_wmajority: https://spruce.mongodb.com/task/sys_perf_linux_1_node_replSet_industry_benchmarks_wmajority_patch_73d1a6f368b04161dce7c0afbcea23efb52e2070_5fb8666657e85a0819d36b59_20_11_21_01_00_30/

linkbench: https://spruce.mongodb.com/task/sys_perf_linux_1_node_replSet_linkbench_patch_73d1a6f368b04161dce7c0afbcea23efb52e2070_5fb8666657e85a0819d36b59_20_11_21_01_00_30/

ycsb_60GB: https://spruce.mongodb.com/task/sys_perf_linux_1_node_replSet_ycsb_60GB_patch_73d1a6f368b04161dce7c0afbcea23efb52e2070_5fb8666657e85a0819d36b59_20_11_21_01_00_30/

 


Generated at Thu Feb 08 05:29:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.