[SERVER-75862] find root cause of system unresponsive live process hang_analyzer Created: 07/Apr/23  Updated: 01/Nov/23  Resolved: 19/Apr/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0

Type: Bug Priority: Critical - P2
Reporter: Daniel Moody Assignee: Daniel Moody
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-75531 EVG tasks wrongly marked as system un... Closed
related to SERVER-75860 temporarily disable live process hang... Closed
related to SERVER-76312 Complete TODO listed in SERVER-75862 Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

The live process dumper was disabled because it caused system unresponsive and loss of the evergreen agent. At the time of writing the prime suspect is that GDB is causing OOM situation which is affecting the evergreen agent, most likely related to the OOM killer picking the evergreen agent for some reason.

 

We need hard evidence that it is an OOM situation and that the OOM killer is the one causing the system unresponsive, or find the real culprit.

 

 



 Comments   
Comment by Daniel Moody [ 19/Apr/23 ]

Closing this because the root cause was verified to be OOM. A workaround is in place and the solution will be implemented in the project this ticket has been added to.

Comment by Daniel Moody [ 13/Apr/23 ]

Some other notes:

1. I did not see the GDB or agent processes actually get killed, so I don't think the OOM killer was actually doing anything here. I believe the system became unresponsive from evergreens perspective because the system was out of memory and the agent was not able to execute in a timely manner to respond or heartbeat to evergreen.

2. I then saw on a restart of the system unresponsive task, the same GDB process that was there from the previous, still taking up a considerable amount of memory.

Comment by Daniel Moody [ 11/Apr/23 ]

After a lot of testing attempting to capture some evidence in the evergreen logs showing indeed the system was operating at 100 percent memory usage, I switched to ssh into the host during the task and monitoring the output of top via "top -c -b".

I found in the last moments before the agent was lost, that there existed two gdb live analysis processes. The combination of the two is enough to quickly put the system at 100% memory usage and thens several seconds later the host is lost.

Generated at Thu Feb 08 06:31:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.