[SERVER-76739] Archive data missing on evergreen task timeouts Created: 02/May/23 Updated: 29/Oct/23 Resolved: 16/May/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.1.0-rc0, 6.0.7, 5.0.19, 7.0.0-rc2 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Alex Neben | Assignee: | Tausif Rahman (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v7.0, v6.0, v5.0
|
||||||||
| Participants: | |||||||||
| Description |
|
Possibly related to Example 1: Example 2: |
| Comments |
| Comment by Tausif Rahman (Inactive) [ 25/May/23 ] |
|
Adding some slack messages here b/c some of the above comments were not 100% right & so people can understand this: Following up here after stand-up:
If (2) & (3) or more specifically 2.1. & 3.1. take more than 15 minutes, we never hit the archival block (here).
maxh In the first example https://parsley.mongodb.com/evergreen/mongodb_mongo_v7.0_enterprise_rhel_80_64_bit_con[…]12_23_04_24_16_15_19/0/task?bookmarks=0,1710&shareLine=1480 the hang analyzer had finished killed the processes and thus had returned from _hang_analyzer.execute() already. I suspect the symbolizer service logic could be where this time is getting spent. My thought would be when the sighandler.py code is called we flip a global setting which disables the symbolizer service from running because the 15-minute clock is tickingThere are definitely cases where the hang is induced because a process crashed and so the symbolizer service logic symbolizing the C++ backtrace for ease of searchability and triage is beneficial. However we'll have a core dump for the crash and so the information is knowable from there and running the symbolizer service logic requires downloading and extracting the debug symbols which like (3a) is saying can be slow |
| Comment by Githook User [ 22/May/23 ] |
|
Author: {'name': 'Tausif Rahman', 'email': 'tausif.rahman@mongodb.com', 'username': 'trahman1318'}Message: (cherry picked from commit 0249b20d05c58ceb47375413e9eb47e028859489) |
| Comment by Githook User [ 22/May/23 ] |
|
Author: {'name': 'Tausif Rahman', 'email': 'tausif.rahman@mongodb.com', 'username': 'trahman1318'}Message: (cherry picked from commit 0249b20d05c58ceb47375413e9eb47e028859489) |
| Comment by Githook User [ 18/May/23 ] |
|
Author: {'name': 'Tausif Rahman', 'email': 'tausif.rahman@mongodb.com', 'username': 'trahman1318'}Message: (cherry picked from commit 0249b20d05c58ceb47375413e9eb47e028859489) |
| Comment by Githook User [ 16/May/23 ] |
|
Author: {'name': 'Tausif Rahman', 'email': 'tausif.rahman@mongodb.com', 'username': 'trahman1318'}Message: |
| Comment by Daniel Moody [ 02/May/23 ] |
|
tausif.rahman@mongodb.com sounds good thanks for clarifying. |
| Comment by Alex Neben [ 02/May/23 ] |
I read this as this was our plan and we don't need to do an atexit? Was that wrong? |
| Comment by Tausif Rahman (Inactive) [ 02/May/23 ] |
|
Sorry daniel.moody@mongodb.com should have been more clear. At first I was considering using the atexit for this, but now I'm scratching the atexit stuff in favor of the more robust SIGUSR1 sighandler which will make SERVER-55756 easier too. |
| Comment by Daniel Moody [ 02/May/23 ] |
|
tausif.rahman@mongodb.com Are you saying use atexit for archiving the data files in this ticket, and then there is a follow on ticket to move archiving process into the SIGUSR1 sighandler? |
| Comment by Alex Neben [ 02/May/23 ] |
|
This sounds great to me! |
| Comment by Tausif Rahman (Inactive) [ 02/May/23 ] |
|
daniel.moody@mongodb.com - Some follow up: Looks like only non-python processes resume. Also, I looked into having the hang-analyzer perform the archive but archive relies on some resmoke test infrastructure (ie: whether the test passed or failed – as well as other more complex logic) which is not currently available via the hang-analyzer. Timeline of events: resmoke.py process: * Starts test suite
Next Steps:
|
| Comment by Daniel Moody [ 02/May/23 ] |
|
Currently the testing itself is the one that saves the test data files, but is that required? Does saving the data files require some test specific knowledge or can hang analyzer perform it? |
| Comment by Alex Neben [ 02/May/23 ] |
|
We can't get every single thing we need but from a high level I think the correct order of importance should be data files (especially since I am totally guessing they should be quicker), core dumps, then stack traces. |
| Comment by Daniel Moody [ 02/May/23 ] |
|
The hang analyzer will resume the python process after its done, unfortunately it ran out of time, specifically during Symbolization of stacktrace files. I don't know if atexit will help because the python processes are suspended from sigusr1, and then when the hang analyzer runs out of time, I am not sure how evergreen shuts down the current running tasks/processes? I assume its a hard kill so at exit would not run? |
| Comment by Tausif Rahman (Inactive) [ 02/May/23 ] |
|
I see that we archive data files for failling tests/hooks in this finally block. When the hang analyzer runs, it sends a SIGUSR1 to the python process & it looks like the archive in the finally block is never getting run. The core dumps still get uploaded because that is performed as part of the hang analyzer process. A quick fix would be to add an atexit method which uploads these on resmoke process exit. |