[SERVER-70802] Mongod data directory and FTDC files not uploaded as part of timeout Created: 24/Oct/22  Updated: 29/Oct/23  Resolved: 21/Dec/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.3.0-rc0

Type: Bug Priority: Major - P3
Reporter: Alex Neben Assignee: Juan Gu
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Gantt Dependency
Related
related to SERVER-72613 Speed up taking core dumps with the h... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:
Linked BF Score: 14

 Description   

https://www.mongodb.com/docs/manual/administration/analyzing-mongodb-performance/#full-time-diagnostic-data-capture

It would appear that FTDC data isn't being collected from a timeout failure.

A project was undertaken to retrieve FTDC data from a timeout failure, including work done in SERVER-46687. Perhaps this stopped working recently? This error was reported because of trying to solve a BF.



 Comments   
Comment by Githook User [ 21/Dec/22 ]

Author:

{'name': 'Juan Gu', 'email': 'juan.gu@mongodb.com', 'username': 'juangugit'}

Message: SERVER-70802 Ensure data files are uploaded on Evergreen timeout
Branch: master
https://github.com/mongodb/mongo/commit/6cd0e5c322cf8cc1b24722543a0f7e5604f85ed8

Comment by Max Hirschhorn [ 17/Nov/22 ]

Juan, Tausif, and I walked through how PM-1569 achieved the hang analyzer causing data files to be uploaded.

  1. The evergreen agent triggers the timeout: phase of the Evergreen project configuration.
  2. python buildscripts/resmoke.py hang-analyzer is called to send a SIGUSR1 to the resmoke process running tests.
  3. The SIGUSR1 handler in resmoke invokes python buildscripts/resmoke.py hang-analyzer to attach the debugger on all the children, grandchildren, etc. of processes transitively spawned through resmoke.
  4. The hang analyzer invokes gdb to capture core dumps of all of the MongoDB processes. See also SERVER-56167.
  5. The hang analyzer invokes gdb to capture additional diagnostic information from the live process from all of the MongoDB processes.
  6. The hang analyzer sends a SIGKILL to all of the analyzed processes.
  7. The resmoke job thread detects the test has now completed and runs archival, if enabled.

The fundamental issue is that resmoke archival running depends on the hang analyzer killing the processes. However, step (5) is known to take a long enough time where the evergreen agent abandons the work in the timeout: phase of the Evergreen project configuration after 15 minutes.

My proposal for SERVER-70802 would be to spawn an additional thread inside of hang analyzer to monitor the gdb process in step (5) and to kill the gdb process so the remaining steps complete within the 15 minutes allotted by the evergreen agent. This strategy will leave us with always getting core dumps and always getting data files.

The longer term outlook I have for collecting diagnostics for hangs is that we rely more on post-processing to get the same information we can from the live process. This has very much been the motivation for me creating https://github.com/visemet/gdb-mongodb-server to supplant the mongodb-dump-locks gdb command.

Generated at Thu Feb 08 06:17:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.