[SERVER-48698] burn_in_tests can exceed time budget and cause task timeout Created: 10/Jun/20  Updated: 06/Dec/22  Resolved: 29/Mar/21

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Charlie Swanson Assignee: [DO NOT ASSIGN] Backlog - Decision Automation Group (DAG) (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-53058 Better account of CleanEveryN runtime... Closed
Assigned Teams:
Decision Automation Group
Operating System: ALL
Participants:

 Description   

In a patch build we observed a task timeout despite everything seemingly completing successfully. In the task logs we can see first that burn_in_tests decides to run the test one more time:

[2020/06/10 03:51:27.320] [executor:multi_stmt_txn_passthrough:job0] 2020-06-10T03:51:27.320+0000 Requeueing test jstests/core/command_let_variables.js 60 of (2/1000/600.00 min/max/time), cumulative time elapsed 588.15 

You can see the previous run took only ~2 seconds (588 - 586):

[2020/06/10 03:51:25.222] [executor:multi_stmt_txn_passthrough:job0] 2020-06-10T03:51:25.222+0000 Requeueing test jstests/core/command_let_variables.js 59 of (2/1000/600.00 min/max/time), cumulative time elapsed 586.06 

Then the latest run hits "CleanEveryN" and decides to restart the cluster. This unfortunately takes a very long time:

[2020/06/10 03:55:12.948] [executor:multi_stmt_txn_passthrough:job0] 2020-06-10T03:55:12.948+0000 command_let_variables:CleanEveryN ran in 223.59 seconds: no failures detected.
 [2020/06/10 03:55:12.952] [executor:multi_stmt_txn_passthrough:job0] 2020-06-10T03:55:12.952+0000 Running job0_fixture_teardown...
 [2020/06/10 03:55:12.990] [executor:multi_stmt_txn_passthrough:job0] 2020-06-10T03:55:12.989+0000 Writing output of job0_fixture_teardown to https://logkeeper.mongodb.org/build/d317c125e63d30f0f698745c7d9acd82/test/5ee059a154f2480b9575c692.
 [2020/06/10 03:56:31.414] Command stopped early: context canceled 

I believe that last line is where evergreen decided it was a timeout and runs the hang analyzer. CleanEveryN ran in 223.59 seconds or approximately 4 minutes, which brings it to approximately 3.5 minutes over the budget of 600s that burn_in_tests allocated. The task timeout appears to be 17 minutes, so I suspect that we need to either increase the task timeout or have burn_in_tests abort the last iteration of the test if it exceeds the time budget.



 Comments   
Comment by David Bradford (Inactive) [ 29/Mar/21 ]

I believe we ended up fixing this in SERVER-53058. If anyone is still seeing this issue in patch builds, please reopen this ticket.

Comment by Vlad Rachev (Inactive) [ 18/Jun/20 ]

I ran into a burn_in_test timeout as well. Patch build

Looking at the task logs it appears that the timeout begins ~3 minutes after job0_fixture_teardown begins. Teardown finishes successfully shortly after. In another build variant in the same patch build the burn_in_tests succeed without timing out, I suspect due to to teardown finishing slightly faster, before the timeout would be hit.

Is there a way for burn_in_test generation to consider the execution time of teardown tasks when deciding how many tasks to schedule? If not, I think a bigger buffer should be given after tests complete before timing out to allow teardown to finish. 

Comment by David Bradford (Inactive) [ 11/Jun/20 ]

Looking at the logs for this execution, CleanEveryN is setting n to 20, which means it will run every 20th test execution. It gets executed 3 times and each time it runs for over 200 seconds (over 3 minutes). Which means we are spending a lot of time on CleanEveryN. We should probably investigate if there are better ways of handling it.

Generated at Thu Feb 08 05:17:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.