[SERVER-44294] Cap runtime of generated tasks Created: 29/Oct/19 Updated: 29/Oct/23 Resolved: 02/Dec/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | 4.3.1 |
| Fix Version/s: | 4.3.3 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Robert Guo (Inactive) | Assignee: | David Bradford (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Backwards Compatibility: | Fully Compatible |
| Sprint: | DAG 2019-12-16 |
| Participants: | |
| Story Points: | 2 |
| Description |
|
When an engineer tries to repro a test failure, they sometimes add a large resmoke_repeat_suites number to evergreen.yml. This causes generated tasks to compute a large Evergreen timeout and potentially leaving a host running for a long time. We should cap the runtime of generated tasks and either error out and inform the user of the max repeat number they can use, or interally reduce the repeat count to a smaller number. Almost always, if an issue fails to repro after 48 hours, it's unlikely for the repro to happen at all. This can indicate a bug with the way the repro is set up, or something wrong with the machine the original failure occurred on. AC:
|
| Comments |
| Comment by Githook User [ 02/Dec/19 ] |
|
Author: {'email': 'david.bradford@mongodb.com', 'name': 'David Bradford', 'username': 'dbradf'}Message: |
| Comment by Robert Guo (Inactive) [ 18/Nov/19 ] |
|
My intention was really just to prevent a single task from taking up all the Evergreen hosts for a week, which hopefully would be a rather quick fix / sanity check. Here's my 2c in response to David's questions; hopefully it'll clarify what I wanted to say:
We want to limit by runtime. Limiting by iteration is arguably more reasonable but many tests (e.g. concurrency) iterate sub-tests themselves, so resmoke doesn't know what the "real" iteration is.
Any solution works. We can run up to the limit and tell the user loudly at the end that the limit has been reduced. Like I said earlier, no tests repro after 48 hours. So if we set the limit to 48 hours and nothing repros, we can assume the test won't repro ever. We can also error out early when we generate the tasks and tell users to reduce the number of iterations with a recommended number that will be below the 48 hour limit.
No. Until someone complains, then feel free to direct them to STM. There's something wrong if a repro takes more than 48 hours.
There are projects that are not mongodb-mongo-* that legitimately run things for more than 48 hours, so I'm not sure if a limit should be enforced system-wide in Evergreen. |
| Comment by David Bradford (Inactive) [ 18/Nov/19 ] |
|
I think there should be some investigation of what we actually want to do here. Do we want to limit by runtime, limit by number of iterations, or something else? If the user is going to hit the limit, what should we do: fail the task, run the task up to the limit and success, something else? Should we allow users someway to override the limit in certain cases. We should also chat with evergreen to see if they have any input as to what the limitation should be. |
| Comment by Robert Guo (Inactive) [ 01/Nov/19 ] |
|
This one: https://evergreen.mongodb.com/version/5dae11857742ae6afd907006 |
| Comment by David Bradford (Inactive) [ 01/Nov/19 ] |
|
robert.guo Do you have an example of a patch where this happened you could link to? |