[SERVER-44294] Cap runtime of generated tasks Created: 29/Oct/19  Updated: 29/Oct/23  Resolved: 02/Dec/19

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: 4.3.1
Fix Version/s: 4.3.3

Type: Improvement Priority: Major - P3
Reporter: Robert Guo (Inactive) Assignee: David Bradford (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Sprint: DAG 2019-12-16
Participants:
Story Points: 2

 Description   

When an engineer tries to repro a test failure, they sometimes add a large resmoke_repeat_suites number to evergreen.yml. This causes generated tasks to compute a large Evergreen timeout and potentially leaving a host running for a long time.

We should cap the runtime of generated tasks and either error out and inform the user of the max repeat number they can use, or interally reduce the repeat count to a smaller number.

Almost always, if an issue fails to repro after 48 hours, it's unlikely for the repro to happen at all. This can indicate a bug with the way the repro is set up, or something wrong with the machine the original failure occurred on.

AC:

  • Fails tasks that we expect to run over the specified time limit.
  • Provide a message to the user explaining why that task was failed and what they can do if they want to work around it.


 Comments   
Comment by Githook User [ 02/Dec/19 ]

Author:

{'email': 'david.bradford@mongodb.com', 'name': 'David Bradford', 'username': 'dbradf'}

Message: SERVER-44294: Cap runtime of generated tasks
Branch: master
https://github.com/mongodb/mongo/commit/4042edbdbea9ef9572e96adfc75b3e3f08b1bdde

Comment by Robert Guo (Inactive) [ 18/Nov/19 ]

My intention was really just to prevent a single task from taking up all the Evergreen hosts for a week, which hopefully would be a rather quick fix / sanity check. Here's my 2c in response to David's questions; hopefully it'll clarify what I wanted to say:

Do we want to limit by runtime, limit by number of iterations, or something else?

We want to limit by runtime. Limiting by iteration is arguably more reasonable but many tests (e.g. concurrency) iterate sub-tests themselves, so resmoke doesn't know what the "real" iteration is.

If the user is going to hit the limit, what should we do:

Any solution works. We can run up to the limit and tell the user loudly at the end that the limit has been reduced. Like I said earlier, no tests repro after 48 hours. So if we set the limit to 48 hours and nothing repros, we can assume the test won't repro ever.

We can also error out early when we generate the tasks and tell users to reduce the number of iterations with a recommended number that will be below the 48 hour limit.

Should we allow users someway to override the limit in certain cases

No. Until someone complains, then feel free to direct them to STM. There's something wrong if a repro takes more than 48 hours.

We should also chat with evergreen to see if they have any input as to what the limitation should be

There are projects that are not mongodb-mongo-* that legitimately run things for more than 48 hours, so I'm not sure if a limit should be enforced system-wide in Evergreen.

Comment by David Bradford (Inactive) [ 18/Nov/19 ]

I think there should be some investigation of what we actually want to do here. Do we want to limit by runtime, limit by number of iterations, or something else? If the user is going to hit the limit, what should we do: fail the task, run the task up to the limit and success, something else? Should we allow users someway to override the limit in certain cases. We should also chat with evergreen to see if they have any input as to what the limitation should be.

Comment by Robert Guo (Inactive) [ 01/Nov/19 ]

This one: https://evergreen.mongodb.com/version/5dae11857742ae6afd907006

Comment by David Bradford (Inactive) [ 01/Nov/19 ]

robert.guo Do you have an example of a patch where this happened you could link to?

Generated at Thu Feb 08 05:05:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.