Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-48698

burn_in_tests can exceed time budget and cause task timeout

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major - P3
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: Backlog
    • Component/s: Testing Infrastructure
    • Labels:
      None
    • Operating System:
      ALL

      Description

      In a patch build we observed a task timeout despite everything seemingly completing successfully. In the task logs we can see first that burn_in_tests decides to run the test one more time:

      [2020/06/10 03:51:27.320] [executor:multi_stmt_txn_passthrough:job0] 2020-06-10T03:51:27.320+0000 Requeueing test jstests/core/command_let_variables.js 60 of (2/1000/600.00 min/max/time), cumulative time elapsed 588.15 

      You can see the previous run took only ~2 seconds (588 - 586):

      [2020/06/10 03:51:25.222] [executor:multi_stmt_txn_passthrough:job0] 2020-06-10T03:51:25.222+0000 Requeueing test jstests/core/command_let_variables.js 59 of (2/1000/600.00 min/max/time), cumulative time elapsed 586.06 

      Then the latest run hits "CleanEveryN" and decides to restart the cluster. This unfortunately takes a very long time:

      [2020/06/10 03:55:12.948] [executor:multi_stmt_txn_passthrough:job0] 2020-06-10T03:55:12.948+0000 command_let_variables:CleanEveryN ran in 223.59 seconds: no failures detected.
       [2020/06/10 03:55:12.952] [executor:multi_stmt_txn_passthrough:job0] 2020-06-10T03:55:12.952+0000 Running job0_fixture_teardown...
       [2020/06/10 03:55:12.990] [executor:multi_stmt_txn_passthrough:job0] 2020-06-10T03:55:12.989+0000 Writing output of job0_fixture_teardown to https://logkeeper.mongodb.org/build/d317c125e63d30f0f698745c7d9acd82/test/5ee059a154f2480b9575c692.
       [2020/06/10 03:56:31.414] Command stopped early: context canceled 

      I believe that last line is where evergreen decided it was a timeout and runs the hang analyzer. CleanEveryN ran in 223.59 seconds or approximately 4 minutes, which brings it to approximately 3.5 minutes over the budget of 600s that burn_in_tests allocated. The task timeout appears to be 17 minutes, so I suspect that we need to either increase the task timeout or have burn_in_tests abort the last iteration of the test if it exceeds the time budget.

        Attachments

          Activity

            People

            Assignee:
            backlog-server-dag Backlog - Decision Automation Group (DAG)
            Reporter:
            charlie.swanson Charlie Swanson
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Dates

              Created:
              Updated: