[SERVER-65493] Ensure the commit queue does not get stuck Created: 12/Apr/22  Updated: 27/Oct/23  Resolved: 13/Mar/23

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Robert Guo (Inactive) Assignee: Robert Guo (Inactive)
Resolution: Gone away Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Sprint: DAG 2022-05-16
Participants:

 Description   

The commit queue should not be stuck regardless of the state of the tests or host issues.

Generate a list of pain points and paths forward for each one as part of this ticket.

Notes from Slack

  • OOM unittests caused the agent to be killed, which caused the task to restart
  • 2 commit queue versions running concurrently, not 3
  • lint running on rhel80-small, not the dedicated CQ variant

Solution spaces:

  • More timely reverts
  • Monitor runtime
  • Bump min host count for the commit queue variant
  • More validation pre-commit queue


 Comments   
Comment by Jeffrey Zambory [ 13/Mar/23 ]

Closing this as the issue seems to have gone away - please comment or reach out to me if you think this is still relevant!

Comment by Jeffrey Zambory [ 01/Feb/23 ]

I just synced with annie.black@mongodb.com on this ticket - the Evergreen team has worked on the commit queue recently and has done several improvements on the experience of using it. The issues described in the above ticket likely should be handled by the improvements the Evergreen team has made. Given that, I'm planning on closing this ticket. 

Does anyone feel otherwise and there's still value in keeping this ticket open? Are there still deliverables around the commit queue experience that we want to look into?

Comment by Robert Guo (Inactive) [ 15/Sep/22 ]

Giving this a bump as there was another occurrence today with high memory usage causing the Evergreen agent to be killed.

[2022/09/15 16:54:26.208] /data/mci/832be73c4e174ba13e85839811cf5e32/toolchain-builder/tmp/build-gdb.sh-bKV/src/gdb-8.3.1/gdb/utils.c:724: internal-error: virtual memory exhausted: can't allocate 33554523 bytes.

Comment by Varun Ravichandran [ 12/Apr/22 ]

To add some context around this:

2 commits were running the run_unittests task simultaneously, having passed all other tasks thus far. The top commit on the stack had a bug that caused it to fail this task, while the second commit was bug-free. For some reason (presumably the OOM unittests), the buggy commit did not fail but instead kept retrying the task repeatedly. Meanwhile, the second commit ended up hanging indefinitely and eventually timed out on the task, but was not kicked off the commit queue. Since the commit queue was only running 2 versions simultaneously, all remaining commits on the queue remained blocked until both commits were manually removed from the queue.

I'm curious whether the OOM was responsible for all of this. Did the task restarts prevent the timeout from firing? Why would the failed agent prevent the second commit from being kicked off the queue once it failed?

We should ensure that even when commits are run in parallel, the status of one commit should not affect the other.

I'm also curious whether the high load of commits yesterday contributed to this, and, if so, how we can mitigate those types of situations.

Generated at Thu Feb 08 06:02:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.