[SERVER-56528] Hang analyzer diagnostics not being collected for C++ unit tests Created: 30/Apr/21 Updated: 29/Oct/23 Resolved: 24/May/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 5.0.0-rc0, 5.1.0-rc0 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Max Hirschhorn | Assignee: | Richard Samuels (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Backport Requested: |
v5.0
|
||||||||||||
| Sprint: | STM 2021-05-31 | ||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 161 | ||||||||||||
| Story Points: | 0 | ||||||||||||
| Description |
|
I'd expect the gdb output to be present in the task logs, as well as core dumps, unit test binaries, and debug symbols to be uploaded to S3. Note that latter three are present for sigaltstack_location_test, but that wasn't the C++ test running at the time of the timeout. Some example cases:
|
| Comments |
| Comment by Vivian Ge (Inactive) [ 06/Oct/21 ] |
|
Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you! |
| Comment by Max Hirschhorn [ 11/Jun/21 ] |
|
richard.samuels, |
| Comment by Richard Samuels (Inactive) [ 24/May/21 ] |
|
With the two changes + backports, we believe we've fixed this issue in master/v5.0. If it reoccurs, please reopen with examples from after May 21st. The underlying cause was a bug in the debug symbol fetcher was causing the hang analyzer to hang. |
| Comment by Githook User [ 21/May/21 ] |
|
Author: {'name': 'Robert Guo', 'email': 'robert.guo@mongodb.com'}Message: |
| Comment by Githook User [ 21/May/21 ] |
|
Author: {'name': 'Robert Guo', 'email': 'robert.guo@mongodb.com'}Message: |
| Comment by Githook User [ 20/May/21 ] |
|
Author: {'name': 'Richard Samuels', 'email': 'richard.l.samuels@gmail.com', 'username': 'richardsamuels'}Message: (cherry picked from commit b2802257c7cd2cf253847d67da5ddcc780a5b85f) |
| Comment by Githook User [ 20/May/21 ] |
|
Author: {'name': 'Richard Samuels', 'email': 'richard.l.samuels@gmail.com', 'username': 'richardsamuels'}Message: |
| Comment by Richard Samuels (Inactive) [ 20/May/21 ] |
|
TLDR: not fixed yet, but we found another bug in the hang analyzer. We'll merge in a fix for that, and keep looking at this.
We don't have a reliable replicator for this, so I've been reading and code, running some toy experiments, and trying to figure out what happened. Here are my thoughts: Not the OOM Killer: We had a suspicion that a process got OOM killed and prevented core dumps, but this wasn't the case.
As a part of that experiment, I deleted the "interesting processes" check from process_list.py to get it to run against my program. Turns out there is a bug here: Historically we would run the hang analyzer against any pid it was supplied with. In the changes made to process_list.py in 9fcca8, get_processes began to also compare the process names of the supplied pids with the list of interesting processes, and exclude any processes that didn't match. This means that some processes can be excluded, which would result in missing core dumps. While this is absolutely a bug, it's not the issue affecting the tasks everyone has cited above. Affected processes were named *_test, or are mongod/mongos instances which matches the interesting processes list. The fix is a net positive, so we'll merge it in.
There is one other unexplained problem: in the tasks linked here, the hang analyzer appears to hang or quit at some point. We observe the root level hang analyzer being triggered, which sends SIGUSR1 to the resmoke processes. This in turn triggers the hang analyzer on the resmoke child processes. All correct so far. I would expect the "inner" runs of the hang analyzer (i.e. those initiated by resmoke in response to SIGUSR1) to print out "Found [n] interesting processes" (even if the n is 0), but I never see that being printed (aside from the root level invocation of the hang analyzer.) The last thing being printed out from the resmoke-initiated hang analyzer instances is "Cannot determine Unix Current Login". This indicates that those instances of the hang analyzer get far enough to call _log_system_info but not finish get_processes. The cause of this remains unknown, and I'll continue looking at this. CR to fix the bug mentioned above: https://mongodbcr.appspot.com/789050001/ |
| Comment by Richard Samuels (Inactive) [ 03/May/21 ] |
|
I suspect that this issue is affecting more tasks than just run_unittests, but it is especially visible on this task because unittests are only collected if the core dump exists, and is thus missing from the tarball. Per some discussion about this issue earlier today, we know the hang analyzer is being called, but for some reason it isn't dumping cores. I've observed this happen on both windows and linux distributions (x86_64), at least one Linux arm variant. |