[SERVER-81306] core-analyzer bug fixes Created: 21/Sep/23  Updated: 03/Nov/23  Resolved: 27/Sep/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.2.0-rc0

Type: Task Priority: Major - P3
Reporter: Trevor Guidry Assignee: Trevor Guidry
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Participants:
Linked BF Score: 61

 Description   

This fixes a couple of bugs in the core analyzer currently.

  1. Before the core analyzer would look at the name of the core dump to determine which binary it was generated from. Turns out this is not correct and binaries can be named something other than the binary they came from so it now uses gdb to determine what the correct binary is.
  2. Validates the core dumps on the failed tasks to ensure we know how to process at least one of them before generating a task. This prevents us from generating a task which ends up doing nothing because it does not know how to process the core dumps.
  3. Makes core dump downloading/uploading less error prone. I ran into some issues where a very small and inconsistent amount of core dumps were corrupted/not a valid gzipped file. I am assuming this issue is because of Pigz so I got rid of it and now use the standard gzip library. I got rid of the timeout in the fast_archive function because evergreen increased the default timeout in the post section to 30 minutes and I don't think we will ever get close to that limit currently. I have made downloading core dumps retry at the core level instead of retrying to download all of the cores at once so if it fails to download one core dump it doesn't ruin the whole task.
  4. Reduces the amount of workers when running gdb. Rarely, when analyzing the core dumps evergreen with terminate the host with system unresponsive because it failed to return a heartbeat. I am guessing this is because we are just clobbering every possible thread on the machine so hopefully lowing the amount of concurrent workers will help this.

 

 

Old description below:

Currently the task is generated if there are any core dumps found on the task. Sometimes we upload core dumps from processes that are not mongo binaries. This can lead to no analysis being done if the only core dumps there are from non-mongo processes.

We need to be smarter about when we generate the tasks and check if at least one of the core dumps is from a known binary

 

An example failure caused by this issue is here https://spruce.mongodb.com/task/mongodb_mongo_master_enterprise_rhel_80_64_bit_dynamic_all_feature_flags_display_replica_sets_abe6f7a64d785277fb223958957252c6f8f89027_23_09_21_11_09_00



 Comments   
Comment by Githook User [ 27/Sep/23 ]

Author:

{'name': 'Trevor Guidry', 'email': 'trevor.guidry@mongodb.com', 'username': ''}

Message: SERVER-81306 core-analyzer bug fixes
Branch: master
https://github.com/mongodb/mongo/commit/3dc142f226ee91d0cf18e251c03808f2b45dd19d

Comment by Trevor Guidry [ 21/Sep/23 ]

max.hirschhorn@mongodb.com Thanks for commenting, I was going to naively use the file name. I will have to think about this more now.

Comment by Max Hirschhorn [ 21/Sep/23 ]

trevor.guidry@mongodb.com, would you please clarify how you intend to detect whether a core dump was generated from a known MongoDB binary? An approach based on the filename won't be possible. This is because %e in kernel.core_pattern=dump_%e.%p.core is substituted with the thread name when the process crashes rather than the process name.

In the linked Evergreen task, BackgroundSync refers to the name of a thread related to the replication subsystem in mongod. It is a core dump which must successfully analyzed.

[2023/09/21 13:29:21.372] Downloading core dump: dump_BackgroundSync.8073.core.gz

https://parsley.mongodb.com/evergreen/mongodb_mongo_master_enterprise_rhel_80_64_bit_dynamic_all_feature_flags_core_analysis_replica_sets_4_linux_enterprise_VH0PM_abe6f7a64d785277fb223958957252c6f8f89027_23_09_21_11_09_00/0/task?bookmarks=0,388,2472&shareLine=388

The core dump contains the contents of /proc/<pid>/exe and this information is (i) displayed in a "Core was generated by" message by gdb and (ii) accessible programmatically within gdb from the info proc exe command.

You may be interested in looking at this logic in failed_unittests_gather.sh as an approach for doing (i).

Generated at Thu Feb 08 06:46:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.