[SERVER-75531] EVG tasks wrongly marked as system unresponsive Created: 31/Mar/23 Updated: 19/May/23 Resolved: 19/Apr/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 7.1.0-rc0 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Tommaso Tocci | Assignee: | Daniel Moody |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Recently I saw an increased number of tasks marked as SYSTEM UNRESPONSIVE. It seems that in some cases those tasks were executing tests that actually failed but the task is wrongly labeled and the real error is swallowed. For instance recently the serverless suite keep failing consistently with SYSTEM UNRESPONSIVE. I've analyzed this task and I noticed that job3 was executing change_collection_server_stats.js (jobs logs) actually failed with:
But this error is not reported in the task logs nor in the evergreen UI. |
| Comments |
| Comment by Daniel Moody [ 19/Apr/23 ] | |||||||
|
Closing this as the underlying issues have been spun off to different tickets/projects. In the general case of this issue we have created EVG-19607 for evergreen to ensure the host does not become unresponsive from bad actor tasks.
For the specific issues that was causing the system unresponsive in this case (the hang analyzer) we have implemented a workaround in https://jira.mongodb.org/browse/SERVER-75860 and created a project to develop a more flexible pipeline for analyzing core dumps in evergreen: https://jira.mongodb.org/browse/PM-3334 | |||||||
| Comment by Daniel Moody [ 05/Apr/23 ] | |||||||
|
max.hirschhorn@mongodb.com I think your right, my changes also incidentally completely disabled the live process analyzer, so that may be the sole culprit for getting OOM killed. I'll make sure to check just disabling the live process analyzer and it still passes with the existing core dump command. | |||||||
| Comment by Max Hirschhorn [ 05/Apr/23 ] | |||||||
|
daniel.moody@mongodb.com, alexander.neben@mongodb.com, what experiment was run to confirm the high memory usage is consumed while taking core dumps? The changes to buildscripts/resmokelib/hang_analyzer/dumper.py which Daniel shared in this earlier comment would affect all gdb invocations and not only the one for running the gcore command. I had added the --readnever option to the gdb invocation when only taking core dumps to match the behavior of the gcore script in
Reusing the same gdb process across multiple mongod processes is intentional to avoid reloading the debug symbols multiple times. I'm not sure I understand the rationale here to change that behavior. | |||||||
| Comment by Daniel Moody [ 04/Apr/23 ] | |||||||
|
alexander.neben@mongodb.com FYI action 2 from above could be done by us. I believe there are ways to launch python subprocesses to disconnect (for the most part) from the parent processes, like new process group flags or similar. maybe its best to figure how to do this at the point resmoke is launched in the evergreen yaml (some bash script probably?). | |||||||
| Comment by Daniel Moody [ 04/Apr/23 ] | |||||||
|
using gdb to generate the core dump is probably more expensive in terms of memory, I could do some testing to compare other options. The thing I am not sure if there are things about gdb that generate the coredump in a certain way we want? for example the core dump command we use:
| |||||||
| Comment by Alex Neben [ 04/Apr/23 ] | |||||||
|
I would love to avoid running gdb dumping in serial. I am wondering if we can use something like https://man7.org/linux/man-pages/man1/gcore.1.html to consume less memory? zakhar.kleyman@mongodb.com who on evergreen side is responsible for the agent? Could either points 1/2 be reasonable? | |||||||
| Comment by Daniel Moody [ 04/Apr/23 ] | |||||||
|
max.hirschhorn@mongodb.com alexander.neben@mongodb.com So just to recap the actions we can take here: 1. The oom killer should not kill the evergreen agent, or at least respawn a new agent beforehand in that case so logging and other normal operations can continue. | |||||||
| Comment by Daniel Moody [ 04/Apr/23 ] | |||||||
|
So I did test where I split up each core dump to separate gdb process per mongo process, and it looks like it didn't oom and system unresponsive with this change. NOTE: I haven't validated the core files yet. Heres the patch: https://spruce.mongodb.com/version/642b40b70ae606eb62568816/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC I suspect this might be a little slower because there multiple gdb process to be launched and loaded. | |||||||
| Comment by Daniel Moody [ 03/Apr/23 ] | |||||||
|
disabling the gdb calls in the hang analyzer gets rid of the system unresponsive so it seems for sure it is the culprit and every task I have looked at showed a high mem usage (>80%) before the last process list was logged, so most likely it is an OOM situation. I don't understand why OOM works on other processes, but this GDB process causes the agent to get killed? Do we need to use the special process flags to ensure its not related to the agent process? In anycase, we should attempt to fix the dumper in the hang analyzer because I think obtaining to core dumps is pretty critical and just OOM killing GDB is not a great solution. Looking at the current dump command, its a single gdb command which attaches; dumps; detaches for each process. Could doing this in a single gdb command cause the memory of each process to be left over? Is there some extra commands we need clear the detached processes memory? | |||||||
| Comment by Daniel Moody [ 03/Apr/23 ] | |||||||
|
It seems the test is green again, someone must have fixed the hang in the test. I will continue investigating assuming a hang will still cause system unresponsive and try injecting a hang situation or building one of the commits that was hanging. | |||||||
| Comment by Max Hirschhorn [ 31/Mar/23 ] | |||||||
Since EVG-5419, the evergreen agent checks the dmesg ring buffer to surface OOM killer events. However if the evergreen agent is also dying then this message doesn't get persisted anywhere because after losing contact the Evergreen app server will have the EC2 host terminated. Perhaps the Evergreen app server could instead contact jasper to start up another evergreen agent or run a command through jasper to check the dmesg ring buffer contents before terminating the EC2 machine so we'd have better diagnostics for this scenario? The output from the OOM killer would appear in the agent logs (for example) however the evergreen agent is only monitoring for OOM killer events at specific points around process cleanup. It isn't doing it continuously and so there's a fair chance the dmesg output never makes it to Cedar logging (assuming an OOM killer event is what is actually happening here).
| |||||||
| Comment by Daniel Moody [ 31/Mar/23 ] | |||||||
|
if it was the oom killer we should see some logs? Unless the oom killer logs after the kill? Should the oom killer be made to log warnings and intents for processes about to be killed some time before? | |||||||
| Comment by Alex Neben [ 31/Mar/23 ] | |||||||
|
Right before the crash we see gdb use a significant amount of memory. "80.4" means that gdb is using 80.4% of the system memory.
If we trace up the ppid list to find the spawning parent we find the parent is
So maybe the oom killer is going all the way up the chain here and is killing evergreen agent? Maybe it cannot allocate memory and is aborting as max suggested. Either way both suggestions i/ii are great. With i being a higher priority. | |||||||
| Comment by Max Hirschhorn [ 31/Mar/23 ] | |||||||
|
I'm marking this as P2 because I think it is a problem for System Unresponsive tasks to be unmonitored. System Unresponsive tasks mean the Evergreen app server lost contact with the evergreen agent running on the machine. Historically this was an indication the EC2 instance was taken away. Evergreen improved reporting in such situations by adding a HOST_TERMINATED_EXTERNALLY event to the host's recent events page. Moreover we are no longer using spot instances where this would really happen. I investigated two independent slew of System Unresponsive tasks which were recently reported via Slack and noticed the hang analyzer was running at the time in all of the tasks. I suspect gdb is allocating a large amount of memory and either the OOM killer is choosing the evergreen agent as its victim or the evergreen agent is aborting due to being unable to allocate memory. I would like to see (i) repeated System Unresponsive tasks start to be monitored so we don't miss ongoing test/code failures in the system and (ii) some investigation performed into why the hang analyzer / gdb is triggering this condition. |