[SERVER-75862] find root cause of system unresponsive live process hang_analyzer Created: 07/Apr/23 Updated: 01/Nov/23 Resolved: 19/Apr/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.1.0-rc0 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Daniel Moody | Assignee: | Daniel Moody |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
The live process dumper was disabled because it caused system unresponsive and loss of the evergreen agent. At the time of writing the prime suspect is that GDB is causing OOM situation which is affecting the evergreen agent, most likely related to the OOM killer picking the evergreen agent for some reason.
We need hard evidence that it is an OOM situation and that the OOM killer is the one causing the system unresponsive, or find the real culprit.
|
| Comments |
| Comment by Daniel Moody [ 19/Apr/23 ] |
|
Closing this because the root cause was verified to be OOM. A workaround is in place and the solution will be implemented in the project this ticket has been added to. |
| Comment by Daniel Moody [ 13/Apr/23 ] |
|
Some other notes: 1. I did not see the GDB or agent processes actually get killed, so I don't think the OOM killer was actually doing anything here. I believe the system became unresponsive from evergreens perspective because the system was out of memory and the agent was not able to execute in a timely manner to respond or heartbeat to evergreen. 2. I then saw on a restart of the system unresponsive task, the same GDB process that was there from the previous, still taking up a considerable amount of memory. |
| Comment by Daniel Moody [ 11/Apr/23 ] |
|
After a lot of testing attempting to capture some evidence in the evergreen logs showing indeed the system was operating at 100 percent memory usage, I switched to ssh into the host during the task and monitoring the output of top via "top -c -b". I found in the last moments before the agent was lost, that there existed two gdb live analysis processes. The combination of the two is enough to quickly put the system at 100% memory usage and thens several seconds later the host is lost. |