[SERVER-76739] Archive data missing on evergreen task timeouts Created: 02/May/23  Updated: 29/Oct/23  Resolved: 16/May/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 6.0.7, 5.0.19, 7.0.0-rc2

Type: Bug Priority: Critical - P2
Reporter: Alex Neben Assignee: Tausif Rahman (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0, v6.0, v5.0
Participants:

 Description   

Possibly related to SERVER-75860

Example 1:

https://parsley.mongodb.com/evergreen/mongodb_mongo_v7.0_enterprise_rhel_80_64_bit_con[…]12_23_04_24_16_15_19/0/task?bookmarks=0,1710&shareLine=1480

https://spruce.mongodb.com/task/mongodb_mongo_v7.0_enterprise_rhel_80_64_bit_concurren[…]3_04_24_16_15_19/logs?execution=0&sortBy=STATUS&sortDir=ASC

Example 2:

https://parsley.mongodb.com/evergreen/mongodb_mongo_v7.0_enterprise_amazon2_arm64_conc[…]12_23_04_24_16_15_19/0/task?bookmarks=0,1763&shareLine=1533



 Comments   
Comment by Tausif Rahman (Inactive) [ 25/May/23 ]

Adding some slack messages here b/c some of the above comments were not 100% right & so people can understand this:
tausif.rahman 

Following up here after stand-up:

  1. call hang-analyzer on live resmoke.py process
    1. -> hang-analyzer sends SIGUSR1 to resmoke.py (here)
  2. resmoke.py sigusr1 handler triggered
    1. -> internally starts another hang-analyzer process on all mongo{d,s} (and other processes) that were spawned by this resmoke.py process. (here)
      1. hang-analyer uses the -c and -k flags which indicates to take the core_dump and kill the process
  3. After the “resmoke spawned” hang-analyzer process finishes, resmoke.py execution continues as normal
    1. Upon completion of the currently running test, resmoke does symbolization which can take a lot of time (here) — the script downloads debug symbols
    2. We also run some hooks — but I think those might be fast (here)

If (2) & (3) or more specifically 2.1. & 3.1. take more than 15 minutes, we never hit the archival block (here).

  1. We can add a --timeout flag to the hang-analyzer where after the timeout, any pids passed with the -k flag will be killed and the process will finish. In 2.1., we will invoke the hang-analyzer with --timeout 12 so that 3 minutes will remain for resmoke to finish up in (3).
  2. We will also not run the symbolizer in 3.1. if SIGUSR1 has been triggered. OR we could also try to archive before symbolization, but I’m not sure how easy that is with the way things are structured. Either way, the archive should always run successfully with the remaining 3 mins.
    **

maxh
Minor clarification for (2.1) the hang analyzer is running in the same resmoke python process. _hang_analyzer.execute() is calling more python code inline and using the subcommand as its entrypoint

In the first example https://parsley.mongodb.com/evergreen/mongodb_mongo_v7.0_enterprise_rhel_80_64_bit_con[…]12_23_04_24_16_15_19/0/task?bookmarks=0,1710&shareLine=1480 the hang analyzer had finished killed the processes and thus had returned from _hang_analyzer.execute() already. I suspect the symbolizer service logic could be where this time is getting spent. My thought would be when the sighandler.py code is called we flip a global setting which disables the symbolizer service from running because the 15-minute clock is tickingThere are definitely cases where the hang is induced because a process crashed and so the symbolizer service logic symbolizing the C++ backtrace for ease of searchability and triage is beneficial. However we'll have a core dump for the crash and so the information is knowable from there and running the symbolizer service logic requires downloading and extracting the debug symbols which like (3a) is saying can be slow

Comment by Githook User [ 22/May/23 ]

Author:

{'name': 'Tausif Rahman', 'email': 'tausif.rahman@mongodb.com', 'username': 'trahman1318'}

Message: SERVER-76739 Archive data missing on evergreen task timeouts

(cherry picked from commit 0249b20d05c58ceb47375413e9eb47e028859489)
Branch: v6.0
https://github.com/mongodb/mongo/commit/98b7678e4fb28d68834c73a28aa107c370292839

Comment by Githook User [ 22/May/23 ]

Author:

{'name': 'Tausif Rahman', 'email': 'tausif.rahman@mongodb.com', 'username': 'trahman1318'}

Message: SERVER-76739 Archive data missing on evergreen task timeouts

(cherry picked from commit 0249b20d05c58ceb47375413e9eb47e028859489)
Branch: v5.0
https://github.com/mongodb/mongo/commit/0e8ad101a75798c5237bb284985cbd6523eb2b6c

Comment by Githook User [ 18/May/23 ]

Author:

{'name': 'Tausif Rahman', 'email': 'tausif.rahman@mongodb.com', 'username': 'trahman1318'}

Message: SERVER-76739 Archive data missing on evergreen task timeouts

(cherry picked from commit 0249b20d05c58ceb47375413e9eb47e028859489)
Branch: v7.0
https://github.com/mongodb/mongo/commit/457a802d9e1d03d8ac9e305a30ff210942b7edd9

Comment by Githook User [ 16/May/23 ]

Author:

{'name': 'Tausif Rahman', 'email': 'tausif.rahman@mongodb.com', 'username': 'trahman1318'}

Message: SERVER-76739 Archive data missing on evergreen task timeouts
Branch: master
https://github.com/mongodb/mongo/commit/0249b20d05c58ceb47375413e9eb47e028859489

Comment by Daniel Moody [ 02/May/23 ]

tausif.rahman@mongodb.com sounds good thanks for clarifying.

Comment by Alex Neben [ 02/May/23 ]
  • Have an archive map which adds an "ArchiveItem" on test creation. Upon test completion, it processes it's corresponding "ArchiveItem". In the SIGUSR1 handler, we archive any items that are still in the archive map. <- (I believe this will make handling SERVER-55756 easier too)

I read this as this was our plan and we don't need to do an atexit? Was that wrong?

Comment by Tausif Rahman (Inactive) [ 02/May/23 ]

Sorry daniel.moody@mongodb.com should have been more clear. At first I was considering using the atexit for this, but now I'm scratching the atexit stuff in favor of the more robust SIGUSR1 sighandler which will make SERVER-55756 easier too.

Comment by Daniel Moody [ 02/May/23 ]

tausif.rahman@mongodb.com Are you saying use atexit for archiving the data files in this ticket, and then there is a follow on ticket to move archiving process into the SIGUSR1 sighandler?

Comment by Alex Neben [ 02/May/23 ]

This sounds great to me!

Comment by Tausif Rahman (Inactive) [ 02/May/23 ]

daniel.moody@mongodb.com - Some follow up: Looks like only non-python processes resume. Also, I looked into having the hang-analyzer perform the archive but archive relies on some resmoke test infrastructure (ie: whether the test passed or failed – as well as other more complex logic) which is not currently available via the hang-analyzer.
------------------
When we send SIGUSR1 from the hang analyzer, the SIGUSR1 handler logic runs on the resmoke.py process & the process is terminated. I did some experimentation & atexit does run successfully right after the SIGUSR1 handler logic & should complete long before the 15 min timeout.

Timeline of events:

resmoke.py process: * Starts test suite

 
…2 Hours later Evergreen Timeout procedure starts…
 
hang-analyzer.py process: * Send SIGUSR1 to resmoke.py process

    • Evg Timeout procedure times out after 15 mins

Next Steps:

  • Have an archive map which adds an "ArchiveItem" on test creation. Upon test completion, it processes it's corresponding "ArchiveItem". In the SIGUSR1 handler, we archive any items that are still in the archive map. <- (I believe this will make handling SERVER-55756 easier too)
Comment by Daniel Moody [ 02/May/23 ]

Currently the testing itself is the one that saves the test data files, but is that required? Does saving the data files require some test specific knowledge or can hang analyzer perform it?

Comment by Alex Neben [ 02/May/23 ]

We can't get every single thing we need but from a high level I think the correct order of importance should be data files (especially since I am totally guessing they should be quicker), core dumps, then stack traces.

Comment by Daniel Moody [ 02/May/23 ]

The hang analyzer will resume the python process after its done, unfortunately it ran out of time, specifically during Symbolization of stacktrace files. I don't know if atexit will help because the python processes are suspended from sigusr1, and then when the hang analyzer runs out of time, I am not sure how evergreen shuts down the current running tasks/processes? I assume its a hard kill so at exit would not run?

Comment by Tausif Rahman (Inactive) [ 02/May/23 ]

I see that we archive data files for failling tests/hooks in this finally block. When the hang analyzer runs, it sends a SIGUSR1 to the python process & it looks like the archive in the finally block is never getting run. The core dumps still get uploaded because that is performed as part of the hang analyzer process. A quick fix would be to add an atexit method which uploads these on resmoke process exit.

Generated at Thu Feb 08 06:33:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.