[DRIVERS-2666] Standardize performance testing infrastructure Created: 30/Jun/23  Updated: 07/Dec/23

Status: Implementing
Project: Drivers
Component/s: Performance Benchmarking
Fix Version/s: None

Type: Improvement Priority: Unknown
Reporter: Daria Pardue Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Node-splunk-perf-alert-slack-config.png     PNG File Node-splunk-perf-alert.png     PNG File Node-splunk-slack-alert-example.png     PNG File plugins_menu.PNG    
Issue Links:
Issue split
split to GODRIVER-2898 Standardize performance testing infra... Backlog
split to RUBY-3290 Standardize performance testing infra... Backlog
split to CXX-2710 Standardize performance testing infra... In Code Review
split to CDRIVER-4676 Standardize performance testing infra... Closed
split to CSHARP-4713 Standardize performance testing infra... Closed
split to JAVA-5065 Standardize performance testing infra... Closed
split to MOTOR-1149 Standardize performance testing infra... Closed
split to NODE-5440 Standardize performance testing infra... Closed
split to PHPLIB-1187 Standardize performance testing infra... Closed
split to PYTHON-3823 Standardize performance testing infra... Closed
split to RUST-1698 Standardize performance testing infra... Closed
Related
related to CSHARP-4670 Implement Drivers Performance Benchma... Closed
related to DRIVERS-2557 Integrate with the Server Performance... Closed
is related to DRIVERS-2779 Standardize performance benchmark rep... Backlog
Driver Changes: Needed - No Spec Changes
Quarter: FY24Q3
Downstream Changes Summary:

PTAL at the description in the DRIVERS ticket. The Node impl. is also available for reference.

Start date:
Driver Compliance:
Key Status/Resolution FixVersion
CDRIVER-4676 Fixed 1.25.0
CXX-2710 In Code Review
CSHARP-4713 Done 2.24.0
GODRIVER-2898 Backlog
JAVA-5065 Fixed 4.11.0
NODE-5440 Duplicate
MOTOR-1149 Won't Do
PYTHON-3823 Fixed 4.7
PHPLIB-1187 Done
RUBY-3290 Backlog
RUST-1698 Fixed 2.8.0

 Description   

see note on DRIVERS-2557

Summary

Drivers should ensure any performance testing, including but not limited to the driver spec performance benchmarks:

  • uses a dedicated distro in evergreen (to avoid fluctuations in performance due to distro changes)
  • uses a patch-pinned server version for integration performance tests (to avoid fluctuations in performance due to server performance profile changes)
  • utilizes the performance analytics backend for change point detection via the perf.send command (the tooling uses sophisticated algorithms to detect real points of change in performance and minimize noise)
  • sets up actionable alerting based on performance results

Motivation

Who is the affected end user?

This should allow driver teams to get ahead of performance regressions

How does this affect the end user?

N/A

How likely is it that this problem or use case will occur?

N/A

If the problem does occur, what are the consequences and how severe are they?

N/A

Is this issue urgent?

The sooner the drivers implement the standardized architecture, the sooner they can start building up a history of reliable performance data.

Is this ticket required by a downstream team?

No

Is this ticket only for tests?

Yes

Acceptance Criteria

This ticket can go directly into teams implementing. Depending on the team's existing setup, teams may choose to create an epic to address different aspects of the work outlined here.

  • The dedicated performance distro is rhel90-dbx-perf-large (others can be created if needed in coordination with the build team)
  • Driver evergreen tools exposes the patch pinned v6 server version 6.0.6 that can be referenced via the `v6.0-perf` version alias (the analogous perf-stable v7 version will be added later)
  • To use the performance analytics backend, it is sufficient to invoke the perf.send command in the evergreen run:
  • Notifications for change point detection can be set up in kanopy splunk and sent directly to slack (there are many additional notification options available)
    • NOTE: The change point detection works retrospectively; so as new data flows in, it could detect statistically significant changes in the distribution which did not exist before and create change points in the past. Usually if it is a large/prominent and sustained change, it gets detected within a few days of the commit date. In more noisy time series / less prominent changes, it could take a while before a change point gets detected on a commit. The sample query below does not limit the allowed date range of the commits for change point detection, however, this is something that can be added to the query if the notifications for commits too far in the past get too noisy. Here's an example query limiting the search range to 60 days:

      message="New change point detected." index="server-tig-prod" | spath output=project path="change_point.time_series_info.project" | search project IN ("mongo-node-driver-next", "node-bson") | spath output=run_date path="change_point.evg_create_date" | eval days_since=(now()-strptime(run_date, "%Y-%m-%dT%H:%M:%S%:z"))/86400 | search days_since < 60
      

    • NOTE #2: All change points can be triaged, linked to jira tickets, and marked as true or false positives in the build baron UI: https://performance-monitoring-and-analysis.server-tig.prod.corp.mongodb.com/baron (sample filter for the node project); however, this UI is somewhat clunky and, considering the expected volume of change points for a typical driver project, may not be the most efficient way for drivers to act on true positives. Therefore, drivers may choose to implement their own process of triaging change points without formally marking each one in the build baron system.
    • NOTE #3: Remember to set appropriate read/write permissions for your alert. Read permissions can be safely set to everyone. However, in order to set your custom alert edit permissions to just your team, your team's mana group will need to be mapped to a kanopy splunk role; if your team does not appear in the role list, you will need to file an IT ticket to request it to be added.

Sample splunk query for a single evergreen project:

message="New change point detected." index="server-tig-prod" | spath "change_point.time_series_info.project" | search "change_point.time_series_info.project"="mongo-node-driver-next"

Sample splunk query for multiple evergreen projects:

message="New change point detected." index="server-tig-prod" | spath "change_point.time_series_info.project" | search "change_point.time_series_info.project" IN ("mongo-node-driver-next", "node-bson")

Sample notification message:

New change point from `$result.change_point.commit_date$`
*$result.change_point.message$* (<https://spruce.mongodb.com/task/$result.change_point.task_id$/trend-charts|CI Link>)
 
```
Project: $result.change_point.time_series_info.project$
Variant: $result.change_point.time_series_info.variant$
Task: $result.change_point.time_series_info.task$
Test: $result.change_point.time_series_info.test$
Measurement: $result.change_point.time_series_info.measurement$
Percent change: $result.change_point.percent_change$
```
 
Included fields: change_point.repo_full_name,change_point.branch

 


Generated at Thu Feb 08 08:26:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.