[SERVER-46693] Parallelize debugger processes in hang-analyzer Created: 06/Mar/20  Updated: 16/Sep/20  Resolved: 22/Apr/20

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Vlad Rachev (Inactive) Assignee: Raiden Worley (Inactive)
Resolution: Won't Fix Votes: 0
Labels: tig-hanganalyzer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Gantt Dependency
has to be done before SERVER-46682 Reuse debugger process for processes ... Closed
Related
Sprint: STM 2020-05-04
Participants:

 Description   

Investigate trying to parallelize debugging all of the processes. This is more of a nice to have, so will timebox it to 2 days [16 hours].

The risk in this ticket is that parallelizing debugger processes could use too much memory (by trying to load debug symbols for all process types). Therefore we should only do this if the ticket above does not give us enough performance gain.

An additional risk is that the overhead of threads could negate the performance increase from parallelization.



 Comments   
Comment by Max Hirschhorn [ 23/Apr/20 ]

Chatted with Raiden over Slack and he said he'd continue looking into using SIGSTOP to pause processes before attaching the debugger.

(gdb) handle SIGSTOP ignore
(gdb) handle SIGSTOP noprint

made it so the mongodb-dump-locks command no longer hit the following error from gdb for me locally

(gdb) mongodb-dump-locks
Running Hang Analyzer Supplement - MongoDBDumpLocks
 
Thread 1 "mongod" received signal SIGSTOP, Stopped (signal).
mongo::LockManager::dump (this=0x7ffb53f974d0 <mongo::(anonymous namespace)::globalLockManager>) at src/mongo/db/concurrency/lock_manager.cpp:845
845    void LockManager::dump() const {
Ignoring error 'The program being debugged was signaled while in a function called from GDB.
GDB remains in the frame where the signal was received.
To change this behavior use "set unwindonsignal on".
Evaluation of the expression containing the function
(mongo::LockManager::dump() const) will be abandoned.
When the function is done executing, GDB will silently stop.' in dump_mongod_locks

Comment by Raiden Worley (Inactive) [ 22/Apr/20 ]

As a sidenote, I tried sending SIGSTOP to all mongods before attaching to any of them as mentioned above, and the outputs seemed the same as without sending the signal. (I got "Not generating the digraph, since the lock graph is empty" from trying to show the waitfor graph). Didn't get any output from dumping locks in gdb, but did get the log output from the lock manager dump appearing in the server logs. It also didn't freeze or anything like I would have expected from the comments above. I could confirm that the SIGSTOP was received by trying to connect to the mongod (commenting out the debugger execution).

I wonder if there were any changes to how the server responds to SIGSTOP, or to how GDB interacts with stopped processes since TIG-768 was first investigated. Might be worth looking into TIG-768 again separately from this.

Comment by Raiden Worley (Inactive) [ 22/Apr/20 ]

Completely parallelizing debugger execution didn't seem to work. The mongod.debug file takes about 3.1 GB of disk space, and running the hang analyzer on a single mongod shows an increase in memory usage of about 3 GB. This scales with the number of debuggers spawned, so running two debuggers at once uses about 6 GB of memory according to free. It looks like the whole file gets loaded into memory.

Interestingly, the memory usage didn't change that much while actually loading symbols, but shot up to 3 GB when dumping threads. I wonder if the "loading symbols" phase is more about indexing the symbols file, while the symbols are only actually loaded into memory when used?

Most of this memory usage is in the buffers/cached line of free, so I was hoping that gdb didn't actually need the whole file in memory and might be smart enough to still work, albeit more slowly, once memory started filling up. I spawned 14 standalone mongods (the -large variant has 30 GB of memory) and ran the parallelized hang analyzer, but memory usage filled up and, right after dumping threads, the output froze and the SSH pipe was broken. Running grep -i kill /var/log/messages* showed a line with gdb invoked oom-killer.

Since the -small RHEL variant only has 7 GB of memory, I think it'll struggle to even run two debuggers in parallel, and definitely crash when running for a whole 3-node replset or more. I think re-using loaded symbols in SERVER-46682 is a more promising way to improve hang analyzer performance, so we'll close this one as "won't fix".

Comment by Vlad Rachev (Inactive) [ 12/Mar/20 ]

We'll investigate this first before SERVER-46682 - if this works out the other ticket won't be needed.

Comment by Vlad Rachev (Inactive) [ 10/Mar/20 ]

In the design max.hirschhorn brought up a good point:

"I will say I'm in favor of experimenting with this idea (e.g. trying to load symbols after attaching) because freezing the state of the processes at approximately the same moments helps to avoid the cluster state being perturbed with heartbeats not being sent, etc. while the debugger is attached."

Given that potential benefit, we will investigate this ticket. As part of the investigation, use large replica sets to test the memory usage.

Generated at Thu Feb 08 05:12:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.