[SERVER-56528] Hang analyzer diagnostics not being collected for C++ unit tests Created: 30/Apr/21  Updated: 29/Oct/23  Resolved: 24/May/21

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 5.0.0-rc0, 5.1.0-rc0

Type: Bug Priority: Critical - P2
Reporter: Max Hirschhorn Assignee: Richard Samuels (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to SERVER-57637 Hang analyzer diagnostics continuing ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Sprint: STM 2021-05-31
Participants:
Linked BF Score: 161
Story Points: 0

 Description   

I'd expect the gdb output to be present in the task logs, as well as core dumps, unit test binaries, and debug symbols to be uploaded to S3. Note that latter three are present for sigaltstack_location_test, but that wasn't the C++ test running at the time of the timeout.

Some example cases:



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Max Hirschhorn [ 11/Jun/21 ]

richard.samuels, SERVER-56528 cannot be reopened due to 5.0.0-rc0 already having been released. I filed SERVER-57637 as a continuation.

Comment by Richard Samuels (Inactive) [ 24/May/21 ]

With the two changes + backports, we believe we've fixed this issue in master/v5.0. If it reoccurs, please reopen with examples from after May 21st.

The underlying cause was a bug in the debug symbol fetcher was causing the hang analyzer to hang.

Comment by Githook User [ 21/May/21 ]

Author:

{'name': 'Robert Guo', 'email': 'robert.guo@mongodb.com'}

Message: SERVER-56528 fix debug symbol downloader
Branch: v5.0
https://github.com/mongodb/mongo/commit/432ddd739359a4cb2ce639d9b0df4fbdb4f13623

Comment by Githook User [ 21/May/21 ]

Author:

{'name': 'Robert Guo', 'email': 'robert.guo@mongodb.com'}

Message: SERVER-56528 fix debug symbol downloader
Branch: master
https://github.com/mongodb/mongo/commit/e500e9d4fcdf014acbab2a5fc9bd81df002c28ae

Comment by Githook User [ 20/May/21 ]

Author:

{'name': 'Richard Samuels', 'email': 'richard.l.samuels@gmail.com', 'username': 'richardsamuels'}

Message: SERVER-56528 hang analyzer should always run when supplied a pid

(cherry picked from commit b2802257c7cd2cf253847d67da5ddcc780a5b85f)
Branch: v5.0
https://github.com/mongodb/mongo/commit/dd8727187e605733f9bce7f4de70a675a7dc9648

Comment by Githook User [ 20/May/21 ]

Author:

{'name': 'Richard Samuels', 'email': 'richard.l.samuels@gmail.com', 'username': 'richardsamuels'}

Message: SERVER-56528 hang analyzer should always run when supplied a pid
Branch: master
https://github.com/mongodb/mongo/commit/b2802257c7cd2cf253847d67da5ddcc780a5b85f

Comment by Richard Samuels (Inactive) [ 20/May/21 ]

TLDR: not fixed yet, but we found another bug in the hang analyzer. We'll merge in a fix for that, and keep looking at this.

 

We don't have a reliable replicator for this, so I've been reading and code, running some toy experiments, and trying to figure out what happened. Here are my thoughts:

Not the OOM Killer: We had a suspicion that a process got OOM killed and prevented core dumps, but this wasn't the case.

  1. There were no references to the oom killer in the system logs.
  2. I tried an experiment where I had a c program trigger the oom killer, and a python script subprocess.Popen that c program (a simulation of what resmoke does). I was able to trigger the hang analyzer against that program. 

As a part of that experiment, I deleted the "interesting processes" check from process_list.py to get it to run against my program. Turns out there is a bug here: Historically we would run the hang analyzer against any pid it was supplied with. In the changes made to process_list.py in 9fcca8, get_processes began to also compare the process names of the supplied pids with the list of interesting processes, and exclude any processes that didn't match. This means that some processes can be excluded, which would result in missing core dumps.

 While this is absolutely a bug, it's not the issue affecting the tasks everyone has cited above. Affected processes were named *_test, or are mongod/mongos instances which matches the interesting processes list. The fix is a net positive, so we'll merge it in.

 

There is one other unexplained problem: in the tasks linked here, the hang analyzer appears to hang or quit at some point. We observe the root level hang analyzer being triggered, which sends SIGUSR1 to the resmoke processes. This in turn triggers the hang analyzer on the resmoke child processes. All correct so far.

I would expect the "inner" runs of the hang analyzer (i.e. those initiated by resmoke in response to SIGUSR1) to print out "Found [n] interesting processes" (even if the n is 0), but I never see that being printed (aside from the root level invocation of the hang analyzer.) The last thing being printed out from the resmoke-initiated hang analyzer instances is "Cannot determine Unix Current Login". This indicates that those instances of the hang analyzer get far enough to call _log_system_info but not finish get_processes. The cause of this remains unknown, and I'll continue looking at this.

CR to fix the bug mentioned above: https://mongodbcr.appspot.com/789050001/

Comment by Richard Samuels (Inactive) [ 03/May/21 ]

I suspect that this issue is affecting more tasks than just run_unittests, but it is especially visible on this task because unittests are only collected if the core dump exists, and is thus missing from the tarball.

Per some discussion about this issue earlier today, we know the hang analyzer is being called, but for some reason it isn't dumping cores. I've observed this happen on both windows and linux distributions (x86_64), at least one Linux arm variant.

Generated at Thu Feb 08 05:39:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.