[SERVER-39969] Fix dumping of SessionCatalog in hang analyzer Created: 05/Mar/19  Updated: 29/Oct/23  Resolved: 09/Apr/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.1.10

Type: Task Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: William Schultz (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File gdb_session_dump_bt.txt    
Issue Links:
Related
related to SERVER-38045 Dump session catalog using GDB scripting Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2019-03-25, Repl 2019-04-08, Repl 2019-04-22
Participants:
Linked BF Score: 22

 Description   

The changes from SERVER-38810 broke the mongod-dump-sessions GDB command. We should disable this command from running in the hang analyzer until we have a solution that is compatible with the new changes, or fix it if not too difficult.



 Comments   
Comment by Githook User [ 09/Apr/19 ]

Author:

{'email': 'william.schultz@mongodb.com', 'name': 'William Schultz', 'username': 'will62794'}

Message: SERVER-39969 Update the 'mongod-dump-sessions' GDB command to be compatible with the new internal format of TransactionParticipant
Branch: master
https://github.com/mongodb/mongo/commit/a76bf86f90358af04ba1184afc5f011295afba04

Comment by William Schultz (Inactive) [ 08/Apr/19 ]

I tried to dig into the segfault issue a bit more and was able to reproduce it and capture a stack trace. gdb_session_dump_bt.txt is the GDB stacktrace. I searched around for some of the frames referenced in that stack trace and it appears that this bug report may be related. This is the command run on a RHEL 6.2 host that eventually segfaults when trying to attach to a hung mongod (running on the same machine) and dump the sessions:

/opt/mongodbtoolchain/gdb/bin/gdb --quiet --nx -ex "file mongod" -ex "attach $(pgrep -o mongod)" -ex "echo \nWriting raw stacks to debugger_mongod_XXXXX_raw_stacks.log.\n"  -ex "source /data/mci/buildscripts/gdb/mongo.py" -ex "source /data/mci/buildscripts/gdb/mongo_printers.py" -ex "source /data/mci/buildscripts/gdb/mongo_lock.py" -ex "set scheduler-locking on"  -ex "set pagination off" -ex "mongod-dump-session" -ex "set confirm off" -ex "quit"

This comment on the bug report referenced above claims that running "set print static-members off" avoids the issue. This indeed appears to work for this case. That is, the following command does not crash when dumping the sessions:

/opt/mongodbtoolchain/gdb/bin/gdb --quiet --nx -ex "file mongod" -ex "attach $(pgrep -o mongod)" -ex "echo \nWriting raw stacks to debugger_mongod_42054_raw_stacks.log.\n"  -ex "source /data/mci/buildscripts/gdb/mongo.py" -ex "source /data/mci/buildscripts/gdb/mongo_printers.py" -ex "source /data/mci/buildscripts/gdb/mongo_lock.py" -ex "set scheduler-locking on"  -ex "set pagination off" -ex "set print static-members off" -ex "mongod-dump-session" -ex "set confirm off" -ex "quit"

There appears to be a patch that fixes the bug, but it is dated 2019-03-25 (~2 weeks prior to the writing of this comment), so it likely isn't merged yet.

Comment by William Schultz (Inactive) [ 08/Apr/19 ]

When testing out the fix proposed above I ran into a new issue. It looks like, on at least one platform (RHEL 6.2 Santiago), GDB is hitting a segmentation fault when it tries to print out the value of the full _sessionId field here. I believe this manifests as a "Bad exit code -11" error as seen in the patch build log here. From some debugging on a spawned RHEL host, it appears that the issue has something to do with printing the _uid field of the LogicalSessionId type. I am not yet sure what the underlying issue is here, but I am running another patch build that disables the printing of the raw _sessionId variable to see if this fixes this problem.

Comment by William Schultz (Inactive) [ 05/Apr/19 ]

Ok, it looks like it won't be too hard to fix this. We just need to account for the fact that fields on the TransactionParticipant object like _txnState have now been pushed inside either the ObservableState type (the _o field), or the PrivateState type (the _p field). So, for example, when extracting fields from the TransactionParticipant, we just extract fields from txnPart['_o'] or txnPart['_p'] instead of txnPart.

Comment by William Schultz (Inactive) [ 02/Apr/19 ]

Running a patch build to see if this is broken or not.

Comment by Judah Schvimer [ 11/Mar/19 ]

We will also consider fixing it since it seems worthwhile.

Comment by William Schultz (Inactive) [ 11/Mar/19 ]

I think the specific issue in BF-12368 may have actually been fixed by SERVER-39972, i.e. it was due to the change in the unique_ptr implementation in GCC 7. We might still want to verify that the mongod-dump-sessions GDB command is still working as expected though after the most recent refactoring.

Generated at Thu Feb 08 04:53:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.