[SERVER-45133] Dump condition variable state in hang analyzer Created: 13/Dec/19  Updated: 08/Jan/24  Resolved: 10/Apr/20

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.7.0

Type: New Feature Priority: Major - P3
Reporter: Judah Schvimer Assignee: Spencer Brody (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Backwards Compatibility: Fully Compatible
Sprint: Service Arch 2020-03-23, Service Arch 2020-04-06, Service arch 2020-04-20
Participants:

 Description   

The Diagnostics for Latches project gave us more information on condition variables. It would be great to include these in the hang analyzer.



 Comments   
Comment by Githook User [ 14/Apr/20 ]

Author:

{'name': 'Spencer T Brody', 'email': 'spencer@mongodb.com', 'username': 'stbrody'}

Message: SERVER-45133 Fix hang analyzer
Branch: master
https://github.com/mongodb/mongo/commit/4cd113b09ee3878798b590a43b64e056de17e303

Comment by Githook User [ 09/Apr/20 ]

Author:

{'name': 'Spencer T Brody', 'email': 'spencer@10gen.com', 'username': 'stbrody'}

Message: SERVER-45133 Dump mutex info in hang analyzer

This reverts commit cbf705e93a73052045e7157380748198b427bcd5.
Branch: master
https://github.com/mongodb/mongo/commit/90e56683e8d9c24f155ffbb64bb13b4169c924e8

Comment by Spencer Brody (Inactive) [ 09/Apr/20 ]

Well I re-ran the mongodb-dump-mutexes gdb command against a core dump from the failure that Billy linked, but it passed without issue.  So now I have no idea what the problem was.

I'm just going to go ahead and re-push this and see if it happens again.

Comment by Benjamin Caimano (Inactive) [ 09/Apr/20 ]

That is possible. The actual registration for the decoration happens here. It's pretty early before we fork, but that doesn't mean it isn't too early.

Comment by Spencer Brody (Inactive) [ 09/Apr/20 ]

I no longer believe this to be an issue of mismatch in stdlib version.

The failure that Billy linked actually has other calls to the hang_analyzer against other mongods in it, and those all work successfully.  The one that fails is run against the mongod running on port 22021, which seems to be the source of the hang that triggered the hang analyzer in the first place.

This mongod is restarted

2020-04-08T05:17:26.961+00:00 I  STORAGE  4615611 [initandlisten] "MongoDB starting","attr":{"pid":98754,"port":22021,"dbPath":"/data/db/job8/mongorunner/backupRestore/mongod-22021","architecture":"64-bit","host":"ip-10-122-19-24"} 

starts wired tiger recovery:

2020-04-08T05:17:27.816+00:00 I  STORAGE  22430   [initandlisten] "WiredTiger message","attr":{"message":"[1586323047:816974][98754:0x7fcb7ba63b40], txn-recover: Set global recovery timestamp: (1586323041, 142)"} 

and then we don't hear from it again until the hang analyzer starts running, almost 20 minutes later

sh101874| Debugger /opt/mongodbtoolchain/gdb/bin/gdb, analyzing mongod process with PID 98754 

 

According to this node's stack trace it never got very far in the startup process at all; it is still stuck in storage engine startup:

[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| Thread 1: "initandlisten" (Thread 0x7fcb7ba63b40 (LWP 98754))
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| #0  0x00007fcb78bfca5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| #1  0x00007fcb7cd92bc9 in __wt_cond_wait_signal (session=session@entry=0x7fcb81256108, cond=0x7fcb811dae00, usecs=usecs@entry=200, run_func=run_func@entry=0x0, signalled=signalled@entry=0x7fffbb3b8c37) at src/third_party/wiredtiger/src/os_posix/os_mtx_cond.c:115
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| #2  0x00007fcb7cd737fa in __wt_cond_wait (run_func=0x0, usecs=200, cond=<optimized out>, session=0x7fcb81256108) at src/third_party/wiredtiger/src/include/misc.i:19
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| #3  __log_wait_for_earlier_slot (session=0x7fcb81256108, slot=0x7fcb83afd290) at src/third_party/wiredtiger/src/log/log.c:171
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| #4  0x00007fcb7cd7837b in __wt_log_release (session=0x7fcb81256108, slot=0x7fcb83afd290, freep=<optimized out>) at src/third_party/wiredtiger/src/log/log.c:1909
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| #5  0x00007fcb7cd7a7ef in __log_write_internal (flags=8, lsnp=0x0, record=<optimized out>, session=0x7fcb81256108) at src/third_party/wiredtiger/src/log/log.c:2668
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| #6  __wt_log_write (session=session@entry=0x7fcb81256108, record=<optimized out>, lsnp=lsnp@entry=0x0, flags=8) at src/third_party/wiredtiger/src/log/log.c:2551
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| #7  0x00007fcb7cdd82ed in __wt_txn_checkpoint_log (session=session@entry=0x7fcb81256108, full=full@entry=true, flags=<optimized out>, lsnp=0x0) at src/third_party/wiredtiger/src/txn/txn_log.c:543
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| #8  0x00007fcb7cdd5897 in __txn_checkpoint (cfg=<optimized out>, session=0x7fcb81256108) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:1037
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| #9  __txn_checkpoint_wrapper (session=0x7fcb81256108, cfg=<optimized out>) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:1076
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| #10 0x00007fcb7cdd6659 in __wt_txn_checkpoint (session=session@entry=0x7fcb81256108, cfg=cfg@entry=0x7fffbb3b9040, waiting=waiting@entry=true) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:1133
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| #11 0x00007fcb7c705dd5 in __session_checkpoint (wt_session=0x7fcb81256108, config=<optimized out>) at src/third_party/wiredtiger/src/session/session_api.c:1979
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| #12 0x00007fcb7cde1ae0 in __wt_rollback_to_stable (session=<optimized out>, cfg=cfg@entry=0x0, no_ckpt=no_ckpt@entry=false) at src/third_party/wiredtiger/src/txn/txn_rollback_to_stable.c:1127
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| #13 0x00007fcb7cdda529 in __wt_txn_recover (session=<optimized out>, session@entry=0x7fcb81254000) at src/third_party/wiredtiger/src/txn/txn_recover.c:758
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.393+0000 sh101874| #14 0x00007fcb7cd4add7 in __wt_connection_workers (session=session@entry=0x7fcb81254000, cfg=cfg@entry=0x7fffbb3b9470) at src/third_party/wiredtiger/src/conn/conn_open.c:211
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.394+0000 sh101874| #15 0x00007fcb7cd46a7d in wiredtiger_open (home=<optimized out>, event_handler=event_handler@entry=0x7fcb811d7488, config=<optimized out>, connectionp=connectionp@entry=0x7fcb811d7478) at src/third_party/wiredtiger/src/conn/conn_api.c:2677
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.394+0000 sh101874| #16 0x00007fcb7cd008f9 in mongo::WiredTigerKVEngine::_openWiredTiger (this=0x7fcb811d7400, path="/data/db/job8/mongorunner/backupRestore/mongod-22021", wtOpenConfig="create,cache_size=1024M,cache_overflow=(file_max=0M),session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor"...) at /opt/mongodbtoolchain/revisions/toolchain_builder_rhel62_patch_57df61bfe08e9c2805a38d201c5cd6c0ea414778_5daee4e830661503f51984bf_19_10_22_11_16_02/stow/gcc-v3.cZb/include/c++/8.2.0/bits/basic_string.h:2281
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.394+0000 sh101874| #17 0x00007fcb7cd06c68 in mongo::WiredTigerKVEngine::WiredTigerKVEngine (this=0x7fcb811d7400, canonicalName=..., path="/data/db/job8/mongorunner/backupRestore/mongod-22021", cs=<optimized out>, extraOpenOptions=..., cacheSizeMB=1024, maxCacheOverflowFileSizeMB=0, durable=true, ephemeral=false, repair=false, readOnly=false) at src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp:762
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.394+0000 sh101874| #18 0x00007fcb7ccce118 in mongo::(anonymous namespace)::WiredTigerFactory::create (this=<optimized out>, params=..., lockFile=<optimized out>) at /opt/mongodbtoolchain/revisions/toolchain_builder_rhel62_patch_57df61bfe08e9c2805a38d201c5cd6c0ea414778_5daee4e830661503f51984bf_19_10_22_11_16_02/stow/gcc-v3.cZb/include/c++/8.2.0/bits/unique_ptr.h:342
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.394+0000 sh101874| #19 0x00007fcb7d3f69f5 in mongo::initializeStorageEngine (service=0x7fcb81055800, initFlags=<optimized out>) at src/third_party/boost-1.70.0/boost/optional/optional.hpp:1448
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.394+0000 sh101874| #20 0x00007fcb7ca77a0f in mongo::(anonymous namespace)::_initAndListen (serviceContext=0x7fcb81055800, listenPort=<optimized out>) at src/mongo/db/db.cpp:359
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.394+0000 sh101874| #21 0x00007fcb7ca7be99 in mongo::(anonymous namespace)::initAndListen (listenPort=<optimized out>, service=0x7fcb81055800) at src/mongo/db/db.cpp:1325
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.394+0000 sh101874| #22 mongo::(anonymous namespace)::mongoDbMain (argc=33, argv=0x7fffbb3bc558, envp=<optimized out>) at src/mongo/db/db.cpp:1325
[js_test:backup_restore_stop_start] 2020-04-08T05:35:27.394+0000 sh101874| #23 0x00007fcb7c9dfc69 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at src/mongo/db/db.cpp:1346 

So maybe the issue is that the DiagnosticInfo decoration never got initialized?

Comment by Githook User [ 08/Apr/20 ]

Author:

{'name': 'Spencer T Brody', 'email': 'spencer@10gen.com', 'username': 'stbrody'}

Message: Revert "SERVER-45133 Dump mutex info in hang analyzer"

This reverts commit 3fc737f00d47fd2c59683579119c2083541ea5a2.
Branch: master
https://github.com/mongodb/mongo/commit/b0cbfa49aa813dfa2c130d1ae02ea081b470a1c2

Comment by Spencer Brody (Inactive) [ 08/Apr/20 ]

Hmm... I don't think the issue is the DiagnosticInfoHandle being missing, since we get past the line where we dereference the DiagnosticInfoHandle to get the the underlying list of DiagnosticInfo objects. That list being empty shouldn't be a problem since we iterate over the list and just won't enter the loop if the list is empty.

The main error from the stack trace appears to be:

Cannot find type class std::forward_list<mongo::DiagnosticInfo, std::allocator<mongo::DiagnosticInfo> >::_Node

This seems to be coming out of the stdlib pretty printers. Given that it actually knows the type of 'diagnostic_info_list', the list must exist. The issue seems to be the pretty printer is expecting the std::forward_list to have an '_Node' member.

Is it possible there's a mismatch in the implementation of the standard library being used and the pretty printer we are using?

This is the line where we import the pretty printer.
Is RHEL6.2 not using the std library from /opt/mongodbtoolchain/v3?

Comment by Billy Donahue [ 08/Apr/20 ]

reopening as we might need a small tweak to add an exception catch to handle the case with no DiagnosticInfo?

Comment by Billy Donahue [ 08/Apr/20 ]

https://logkeeper.mongodb.org/lobster/build/30469f88f08428fb4f1724ed1bc7907e/test/5e8d5e41be07c4639747b484#bookmarks=0%2C37103%2C37266&l=1

[The /sh101874/ lines in the log are the hang analyzer, included below] 

The mutex dumper in the hang analyzer is crapping out on RHEL6.2 in evergreen.
Maybe this is a build with no DiagnosticInfo to dump?

Would be good if we handled that case gracefully.

 

Not generating the digraph, since the lock graph is empty
Running Print JavaScript Stack Supplement
Ignoring GDB error 'No type "mozjs" within class or namespace "mongo".' in javascript_stack
Ignoring GDB error 'No type "mozjs" within class or namespace "mongo".' in javascript_stack
Ignoring GDB error 'No type "mozjs" within class or namespace "mongo".' in javascript_stack
Ignoring GDB error 'No type "mozjs" within class or namespace "mongo".' in javascript_stack
Ignoring GDB error 'No type "mozjs" within class or namespace "mongo".' in javascript_stack
Dumping 0 Session objects from the SessionCatalog
Dumping mutex info for all Clients
Traceback (most recent call last):
  File "/data/mci/449f55c2e723de4855da099604d011c4/src/buildscripts/gdb/mongo.py", line 385, in invoke
    for _, diag_info in printer.children():
  File "/opt/mongodbtoolchain/v3/share/gcc-8.2.0/python/libstdcxx/v6/printers.py", line 976, in children
    nodetype = find_type(self.val.type, '_Node')
  File "/opt/mongodbtoolchain/v3/share/gcc-8.2.0/python/libstdcxx/v6/printers.py", line 99, in find_type
    raise ValueError("Cannot find type %s::%s" % (str(orig), name))
ValueError: Cannot find type class std::forward_list<mongo::DiagnosticInfo, std::allocator<mongo::DiagnosticInfo> >::_Node
Error occurred in Python command: Cannot find type class std::forward_list<mongo::DiagnosticInfo, std::allocator<mongo::DiagnosticInfo> >::_Node
Detaching from program: /data/mci/449f55c2e723de4855da099604d011c4/src/dist-test/bin/mongod, process 98754
[Inferior 1 (process 98754) detached]
Done analyzing mongod process with PID 98754
Done analyzing all processes for hangs
./buildscripts/hang_analyzer.py:651: DeprecationWarning: dist() and linux_distribution() functions are deprecated in Python 3.5
  distro = platform.linux_distribution()

 
 

Comment by Githook User [ 07/Apr/20 ]

Author:

{'name': 'Spencer T Brody', 'email': 'spencer@10gen.com', 'username': 'stbrody'}

Message: SERVER-45133 Dump mutex info in hang analyzer
Branch: master
https://github.com/mongodb/mongo/commit/3fc737f00d47fd2c59683579119c2083541ea5a2

Comment by Benjamin Caimano (Inactive) [ 03/Apr/20 ]

Yeah, it is correct. We have SERVER-45059 to be able to distinguish between lock and waitForConditionOrInterrupt.

Comment by Misha Tyulenev [ 03/Apr/20 ]

spencer it sounds correct, CC ben.caimano

Comment by Spencer Brody (Inactive) [ 02/Apr/20 ]

Actually, it looks like if you are sleeping on a condition variable, you will still show up as waiting on the mutex that you released to go to sleep on the condvar.  It looks like there's no way to distinguish a thread that is blocked waiting on acquiring a mutex from one that has successfully acquired a mutex and is now waiting on a condition variable (which will implicitly unlock the mutex).

misha.tyulenev, is that expected?

Comment by Spencer Brody (Inactive) [ 01/Apr/20 ]

The title of this ticket is asking about dumping condvar state, but it's worth noting that the "Diagnostics for Latches" gave us info on latches (mutexes) not condition variables.  If a thread is trying to wake up from a condition variable sleep but can't because it can't acquire the mutex it needs, then that will show up here, but if a thread is sleeping waiting on a condition variable because they're waiting for some arbitrary condition to become true, that won't be captured here.

Comment by Spencer Brody (Inactive) [ 25/Mar/20 ]

Given that this information is already captured in currentOp, I wonder if we would be better served by turning this ticket into "dump currentOp() in hang analyzer"? mira.carey@mongodb.com

Comment by Ian Whalen (Inactive) [ 06/Jan/20 ]

Assigning to Service Arch given that Service Arch's project was to help developers with using latches, we believe that this makes a lot more sense to live with the service arch team (also because they just generally will have a lot more context around how to do this.)

Generated at Thu Feb 08 05:07:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.