[SERVER-46360] Hang analyzer json logging improvement Created: 24/Feb/20  Updated: 29/Oct/23  Resolved: 10/Mar/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.4.0-rc4, 4.7.0

Type: Improvement Priority: Major - P3
Reporter: Andy Schwerin Assignee: Billy Donahue
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-46634 LockInfo results missing a layer of h... Closed
Documented
is documented by DOCS-11516 document lockInfo command Closed
Related
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4
Sprint: Dev Tools 2020-02-24, Dev Tools 2020-03-09, Dev Tools 2020-03-23
Participants:

 Description   

Log messages from the hang analyzer under structured logging should be structured, not just text embedded in a string in a json attribute.



 Comments   
Comment by Githook User [ 04/May/20 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-46360 LockManager logging improvement (and lockInfo)

(cherry picked from commit e2d5e628f4c65a66eb5ca39f43aa7f71b2fd1d6b)
Branch: v4.4
https://github.com/mongodb/mongo/commit/13df3646789570a6910eea75239859fcac03d7f8

Comment by Billy Donahue [ 10/Mar/20 ]

changes the undocumented lockInfo command

Comment by Githook User [ 10/Mar/20 ]

Author:

{'username': 'BillyDonahue', 'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com'}

Message: SERVER-46360 LockManager logging improvement (and lockInfo)
Branch: master
https://github.com/mongodb/mongo/commit/e2d5e628f4c65a66eb5ca39f43aa7f71b2fd1d6b

Comment by Billy Donahue [ 05/Mar/20 ]

Code Review:
https://mongodbcr.appspot.com/578250001/

Also addresses SERVER-46634

Comment by Billy Donahue [ 02/Mar/20 ]

I think I didn't hit the right target in my first commit here.
With an interactive gdb, I was able to cause the hang-analyzer's mongo_locks.py functions to emit something, and it's still ugly.

(gdb) mongodb-show-locks
Mutex at 0x7fffe7731ea8 held by "hangTest0" (Thread 0x7fffd1cd5700 (LWP 24937)) waited on by "hangTest3" (Thread 0x7fffd04d2700 (LWP 24940))
Mutex at 0x7fffe7731ea8 held by "hangTest0" (Thread 0x7fffd1cd5700 (LWP 24937)) waited on by "hangTest2" (Thread 0x7fffd0cd3700 (LWP 24939))
Mutex at 0x7fffe7731ea8 held by "hangTest0" (Thread 0x7fffd1cd5700 (LWP 24937)) waited on by "hangTest1" (Thread 0x7fffd14d4700 (LWP 24938))
(gdb) mongodb-dump-locks
Running Hang Analyzer Supplement - MongoDBDumpLocks
{"t":{"$date":"2020-03-02T15:11:27.255-0500"},"s":"I", "c":"-",       "id":20521,"ctx":"initandlisten","msg":"Dumping LockManager @ {lock_manager}","attr":{"lock_manager":140737295262928}}
{"t":{"$date":"2020-03-02T15:11:27.255-0500"},"s":"I", "c":"-",       "id":20522,"ctx":"initandlisten","msg":"{sb_str}","attr":{"sb_str":"Lock @ 0x7FFFDE7E6BE0: {6917529027641081857: Global, 1}\nGRANTED:\n\tLockRequest 180 @ 0x7FFFDF0C4A20: Mode = IS; Thread = 140736797210368 | 0x7fffd6cdf700; ConvertMode = NONE; EnqueueAtFront = 0; CompatibleFirst = 0; DebugInfo = { serverStatus: 1, tcMalloc: true, sharding: false, timing: false, defaultRWConcern: false, $db: \"\" }; ClientInfo = { desc: \"ftdc\", opid: 180 }\nPENDING:\n-----------------------------------------------------------\n"}}
{"t":{"$date":"2020-03-02T15:11:27.255-0500"},"s":"I", "c":"-",       "id":20522,"ctx":"initandlisten","msg":"{sb_str}","attr":{"sb_str":"Lock @ 0x7FFFDF4EB860: {4611686018427387905: ReplicationStateTransition, 1}\nGRANTED:\n\tLockRequest 180 @ 0x7FFFDF0C4A20: Mode = IX; Thread = 140736797210368 | 0x7fffd6cdf700; ConvertMode = NONE; EnqueueAtFront = 0; CompatibleFirst = 0; DebugInfo = { serverStatus: 1, tcMalloc: true, sharding: false, timing: false, defaultRWConcern: false, $db: \"\" }; ClientInfo = { desc: \"ftdc\", opid: 180 }\nPENDING:\n-----------------------------------------------------------\n"}}
{"t":{"$date":"2020-03-02T15:11:27.255-0500"},"s":"I", "c":"-",       "id":20522,"ctx":"initandlisten","msg":"{sb_str}","attr":{"sb_str":"Lock @ 0x7FFFDF4EC3A0: {16140901064495857666: Mutex, 2, checkpointCursorMutex}\nGRANTED:\n\tLockRequest 83 @ 0x7FFFE02F6D20: Mode = X; Thread = 140736830781184 | 0x7fffd8ce3700; ConvertMode = NONE; EnqueueAtFront = 0; CompatibleFirst = 0; DebugInfo = ; ClientInfo = { desc: \"WTCheckpointThread\", opid: 83 }\nPENDING:\n-----------------------------------------------------------\n"}}

The real target here should be message IDs related to 20521 and 20522.

Comment by Githook User [ 28/Feb/20 ]

Author:

{'name': 'Billy Donahue', 'username': 'BillyDonahue', 'email': 'billy.donahue@mongodb.com'}

Message: SERVER-46360 LockManager dump improvements
Branch: master
https://github.com/mongodb/mongo/commit/793ae32c597f197b6445750aa9bfdaabc206132d

Comment by Eric Milkie [ 27/Feb/20 ]

Your analysis sounds correct to me. I didn't actually see a hang analyzer graph creation fail; only the output of the lock dump. So this ticket can be to just clean up the lock dump output.

Comment by Billy Donahue [ 26/Feb/20 ]

milkie, upon closer inspection, the hang analyzer seems to be doing all of its reconnaissance through gdb inspection of C++ symbols and function calls. This is all in `buildscripts/gdb/mongo_lock.py`. The construction of the lock graph is done by find_mutex_holder. In there we just backtrace all the threads and examine their extant call to std::mutex::lock, if any. We extract the waited-on mutex object's __owner member, and generate a graph of waiting threads to mutex-holding threads, connected by edges representing mutex objects that connect the threads. It all looks to be on the up-and-up, and I don't see any structured logging influence here.

The only thing structured logging affects is the stdout of the hang_analyzer. Is that the only problem?
That is, it looks like the generation of the "waitsfor_graph" is fine? Please let me know if I've got this right and if you have a method to test this. Just running a mongod and attaching gdb, I don't see any locks so the hang analyzer is a little boring.

Comment by Billy Donahue [ 26/Feb/20 ]

Summary of problem and requirements as I understand them.

The hang_analyzer.py attaches gdb to a running mongod. gdb (via the buildscripts/gdb/mongo.py command extensions) makes a target function call to the C++ function

"call ('mongo::(anonymous namespace)::globalLockManager').dump()"

(updated to correct function name)

So we need to upgrade the LOGV2 message from its current StringBuilder into a structured object, and change the lines that parse the gdb output.

Comment by Eric Milkie [ 26/Feb/20 ]

I think we need to update the lock-dump output, and then change the hang analyzer helper script that generates the waits-for graph to be able to parse this new format.

Comment by Andy Schwerin [ 24/Feb/20 ]

Not sure exactly the required improvement, here.

Generated at Thu Feb 08 05:11:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.