[SERVER-82662] Reduce sharding_jscore_passthrough task runtime Created: 01/Nov/23  Updated: 21/Nov/23  Resolved: 21/Nov/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Zack Winter Assignee: Max Hirschhorn
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screenshot 2023-11-21 at 8.28.47 AM.png    
Issue Links:
Problem/Incident
Assigned Teams:
Sharding NYC
Participants:
Linked BF Score: 41

 Description   

The sharding_jscore_passthrough test suite was hitting a 1hr 20min timeout causing BF-30612. Please take a look at the suite and see if there's any way to reduce the task runtime without significantly decreasing the value of the test suite.



 Comments   
Comment by Max Hirschhorn [ 21/Nov/23 ]

The sharding_jscore_passthrough.yml test suite runs the tests from the jstests/core/ directory against a 1-shard sharded cluster. The tests in the jstests/core/ directory are largely for ensuring the API contracts of CRUD and DDL operations are upheld. There is definitely overlap between the tests for the basic operations they perform although without an investigation like DEVPROD-278 or a different project aimed at test minimization, I wouldn't feel comfortable removing a swath of them from the test suite.

Looking at the task timing we can see that the sharding_jscore_passthrough was taking ~40 minutes back in late 2022 and then had multiple step-functions changes where it started to take ~65 minutes and eventually crept up to 80 minutes and hitting the exec timeout. I briefly looked through the commit range for one of the step-function increases but none of the changes made it obvious to me where the extra time is coming from.

git log 85f9d6ae24d6dc491a5f7ea84054931a83f2781e~..61cfb094c61b3d23566718c0c130074f711ea61d --stat -- jstests/core/ buildscripts/resmokeconfig/suites/
git log 85f9d6ae24d6dc491a5f7ea84054931a83f2781e~..61cfb094c61b3d23566718c0c130074f711ea61d

I think converting all Evergreen tasks which run resmoke test suites to be generated tasks and adding separate alerting when test runtime or combined test suite runtime exceeds desired spend is the path forward. Ideally we would implement signal processing on the test runtime and combined test suite runtime so we can stop reacting to the slowdown only after it crosses a constant threshold. CC alex.neben@mongodb.com

Generated at Thu Feb 08 06:49:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.