[SERVER-27874] Display locks and generate digraph for threads using LockManager locks and/or pthread_mutexes Created: 31/Jan/17  Updated: 07/Sep/17  Resolved: 16/Mar/17

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 3.4.5, 3.5.5

Type: Improvement Priority: Major - P3
Reporter: Jonathan Abrahams Assignee: Jonathan Abrahams
Resolution: Done Votes: 0
Labels: bkp
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-28234 GDB frame information not available o... Closed
depends on SERVER-28373 GDB thread-local variables not availa... Closed
is depended on by SERVER-28348 Detect single-process deadlocks invol... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.4
Sprint: TIG 2017-03-06, TIG 2017-03-27
Participants:

 Description   

This just needs to be done for mongod in GDB.

Integrate siyuan.zhou's code and verify it works with all combinations of --dbg=off/--dbg=on and --opt=off/--opt=on builds.



 Comments   
Comment by Githook User [ 09/May/17 ]

Author:

{u'username': u'hptabster', u'name': u'Jonathan Abrahams', u'email': u'jonathan@mongodb.com'}

Message: SERVER-27874 - Fix quoting for test_flags
Branch: master
https://github.com/mongodb/mongo/commit/3448ddb2ec7b3e210dfc97c0296dd7496f5c7813

Comment by Githook User [ 28/Apr/17 ]

Author:

{u'username': u'hptabster', u'name': u'Jonathan Abrahams', u'email': u'jonathan@mongodb.com'}

Message: SERVER-27874 - Hang analysis thread backtrace and mongo locks

  • Run unique thread on Solaris
  • Add a legend to graph file
  • Do not generate digraph file, if graph is empty

(cherry picked from commit 19abe0c2dacef784aad78b89d6c6111109fbca88)
Branch: v3.4
https://github.com/mongodb/mongo/commit/6417b938fa92ba778f2d6944b7070bfe30b0115f

Comment by Githook User [ 28/Apr/17 ]

Author:

{u'username': u'hptabster', u'name': u'Jonathan Abrahams', u'email': u'jonathan@mongodb.com'}

Message: SERVER-27874 Display locks and generate digraph for threads using LockManager locks and/or pthread_mutexes

(cherry picked from commit 5fe822f53e4bb28e15af2541c0ca931fa05a0e20)
Branch: v3.4
https://github.com/mongodb/mongo/commit/e900ab362a6d40bd24fa1499fb0dbb5e9908f75d

Comment by Githook User [ 21/Mar/17 ]

Author:

{u'username': u'hptabster', u'name': u'Jonathan Abrahams', u'email': u'jonathan@mongodb.com'}

Message: SERVER-27874 - Hang analysis thread backtrace and mongo locks

Comment by Githook User [ 16/Mar/17 ]

Author:

{u'username': u'hptabster', u'name': u'Jonathan Abrahams', u'email': u'jonathan@mongodb.com'}

Message: SERVER-27874 Display locks and generate digraph for threads using LockManager locks and/or pthread_mutexes
Branch: master
https://github.com/mongodb/mongo/commit/5fe822f53e4bb28e15af2541c0ca931fa05a0e20

Comment by Jonathan Abrahams [ 28/Feb/17 ]

Per eddie.louie's comment in BF-4651:

1. The LockManager dump via mongo-analyze does not output information on mutexes. i.e. mutexes that threads are waiting on. Not sure if this can be remedied? If so, we can potentially find the deadlock cycle with just this output.
2. The mongo-deadlock-detect call outputs information on threads in two different ways, which may mislead as to whether a deadlock exists. The digraph that is output uses the '68 Thread 0x7f4b2bfdd700 (LWP 27095)' label as well as just the '27095' id. So if you draw the graph using the GraphvizFiddle you may not see an obvious cycle.

Comment by Siyuan Zhou [ 15/Feb/17 ]

I ran the visualization tool on the core dump of a deadlock found in my patch build.

Luckily, it only involved mutexes, so the tool worked. I managed to get the dot graph by running the following commands in gdb after downloading mongo_lock.py from my repo.

set pagination off
source ../source-patch-180_mongodb-mongo-master/buildscripts/gdb/mongo.py
source ../mongo_lock.py
mongodb-deadlock-detect

This website can visualize the following graph for you. It's clear that there's a cycle.

digraph "mongod+lock-status" {
    "0x7fcafb035700" -> "0x7fcb27517c08";
    "0x7fcb03846700" -> "0x7fcb27517c08";
    "0x7fcb04848700" -> "0x7fcb27517c08";
    "0x7fcb22481700" -> "0x7fcb27517c08";
    "0x7fcb11475700" -> "0x7fcb27517c08";
    "0x7fcb13c7a700" -> "0x7fcb2a75f058";
    "0x7fcb1547d700" -> "0x7fcb27517c08";
    "0x7fcb15c7e700" -> "0x7fcb27517c08";
    "0x7fcb27517c08" -> "0x7fcb13c7a700";
    "0x7fcb2a75f058" -> "0x7fcb04848700";
    "0x7fcb22482da0" [label="65   Thread 0x7fcb22482da0 (LWP 14599)"]
    "0x7fcafa834700" [label="64   Thread 0x7fcafa834700 (LWP 14665)"]
    "0x7fcafb035700" [label="63   Thread 0x7fcafb035700 (LWP 14664)"]
    "0x7fcafb836700" [label="62   Thread 0x7fcafb836700 (LWP 14663)"]
    "0x7fcafc037700" [label="61   Thread 0x7fcafc037700 (LWP 14662)"]
    "0x7fcafc838700" [label="60   Thread 0x7fcafc838700 (LWP 14661)"]
    "0x7fcafd039700" [label="59   Thread 0x7fcafd039700 (LWP 14660)"]
    "0x7fcafd83a700" [label="58   Thread 0x7fcafd83a700 (LWP 14659)"]
    "0x7fcafe03b700" [label="57   Thread 0x7fcafe03b700 (LWP 14658)"]
    "0x7fcafe83c700" [label="56   Thread 0x7fcafe83c700 (LWP 14657)"]
    "0x7fcaff03d700" [label="55   Thread 0x7fcaff03d700 (LWP 14656)"]
    "0x7fcaff83e700" [label="54   Thread 0x7fcaff83e700 (LWP 14655)"]
    "0x7fcb0003f700" [label="53   Thread 0x7fcb0003f700 (LWP 14654)"]
    "0x7fcb00840700" [label="52   Thread 0x7fcb00840700 (LWP 14653)"]
    "0x7fcb01041700" [label="51   Thread 0x7fcb01041700 (LWP 14652)"]
    "0x7fcb01842700" [label="50   Thread 0x7fcb01842700 (LWP 14651)"]
    "0x7fcb02043700" [label="49   Thread 0x7fcb02043700 (LWP 14650)"]
    "0x7fcb02844700" [label="48   Thread 0x7fcb02844700 (LWP 14649)"]
    "0x7fcb03045700" [label="47   Thread 0x7fcb03045700 (LWP 14648)"]
    "0x7fcb03846700" [label="46   Thread 0x7fcb03846700 (LWP 14647)"]
    "0x7fcb04047700" [label="45   Thread 0x7fcb04047700 (LWP 14646)"]
    "0x7fcb04848700" [label="44   Thread 0x7fcb04848700 (LWP 14645)"]
    "0x7fcb05049700" [label="43   Thread 0x7fcb05049700 (LWP 14644)"]
    "0x7fcb0584a700" [label="42   Thread 0x7fcb0584a700 (LWP 14643)"]
    "0x7fcb0604b700" [label="41   Thread 0x7fcb0604b700 (LWP 14642)"]
    "0x7fcb0684c700" [label="40   Thread 0x7fcb0684c700 (LWP 14641)"]
    "0x7fcb0704d700" [label="39   Thread 0x7fcb0704d700 (LWP 14640)"]
    "0x7fcb0784e700" [label="38   Thread 0x7fcb0784e700 (LWP 14639)"]
    "0x7fcb0804f700" [label="37   Thread 0x7fcb0804f700 (LWP 14638)"]
    "0x7fcb08850700" [label="36   Thread 0x7fcb08850700 (LWP 14637)"]
    "0x7fcb09051700" [label="35   Thread 0x7fcb09051700 (LWP 14636)"]
    "0x7fcb09852700" [label="34   Thread 0x7fcb09852700 (LWP 14635)"]
    "0x7fcb0a053700" [label="33   Thread 0x7fcb0a053700 (LWP 14634)"]
    "0x7fcb0a854700" [label="32   Thread 0x7fcb0a854700 (LWP 14633)"]
    "0x7fcb0b055700" [label="31   Thread 0x7fcb0b055700 (LWP 14632)"]
    "0x7fcb0b856700" [label="30   Thread 0x7fcb0b856700 (LWP 14631)"]
    "0x7fcb0c057700" [label="29   Thread 0x7fcb0c057700 (LWP 14630)"]
    "0x7fcb0c858700" [label="28   Thread 0x7fcb0c858700 (LWP 14629)"]
    "0x7fcb0d059700" [label="27   Thread 0x7fcb0d059700 (LWP 14628)"]
    "0x7fcb0d85a700" [label="26   Thread 0x7fcb0d85a700 (LWP 14627)"]
    "0x7fcb0e05b700" [label="25   Thread 0x7fcb0e05b700 (LWP 14626)"]
    "0x7fcb0e85c700" [label="24   Thread 0x7fcb0e85c700 (LWP 14625)"]
    "0x7fcb0f05d700" [label="23   Thread 0x7fcb0f05d700 (LWP 14624)"]
    "0x7fcb22481700" [label="22   Thread 0x7fcb22481700 (LWP 14623)"]
    "0x7fcb0fc72700" [label="21   Thread 0x7fcb0fc72700 (LWP 14622)"]
    "0x7fcb10473700" [label="20   Thread 0x7fcb10473700 (LWP 14621)"]
    "0x7fcb10c74700" [label="19   Thread 0x7fcb10c74700 (LWP 14620)"]
    "0x7fcb11475700" [label="18   Thread 0x7fcb11475700 (LWP 14619)"]
    "0x7fcb11c76700" [label="17   Thread 0x7fcb11c76700 (LWP 14618)"]
    "0x7fcb12477700" [label="16   Thread 0x7fcb12477700 (LWP 14617)"]
    "0x7fcb12c78700" [label="15   Thread 0x7fcb12c78700 (LWP 14616)"]
    "0x7fcb13479700" [label="14   Thread 0x7fcb13479700 (LWP 14615)"]
    "0x7fcb13c7a700" [label="13   Thread 0x7fcb13c7a700 (LWP 14614)"]
    "0x7fcb1447b700" [label="12   Thread 0x7fcb1447b700 (LWP 14613)"]
    "0x7fcb1547d700" [label="11   Thread 0x7fcb1547d700 (LWP 14612)"]
    "0x7fcb14c7c700" [label="10   Thread 0x7fcb14c7c700 (LWP 14611)"]
    "0x7fcb15c7e700" [label="9    Thread 0x7fcb15c7e700 (LWP 14609)"]
    "0x7fcb1647f700" [label="8    Thread 0x7fcb1647f700 (LWP 14608)"]
    "0x7fcb16c80700" [label="7    Thread 0x7fcb16c80700 (LWP 14607)"]
    "0x7fcb17481700" [label="6    Thread 0x7fcb17481700 (LWP 14606)"]
    "0x7fcb17c82700" [label="5    Thread 0x7fcb17c82700 (LWP 14605)"]
    "0x7fcb18483700" [label="4    Thread 0x7fcb18483700 (LWP 14604)"]
    "0x7fcb18c84700" [label="3    Thread 0x7fcb18c84700 (LWP 14603)"]
    "0x7fcb19485700" [label="2    Thread 0x7fcb19485700 (LWP 14602)"]
    "0x7fcb19c86700" [label="1    Thread 0x7fcb19c86700 (LWP 14601)"]
    "0x7fcb27517c08" [label="Mutex"]
    "0x7fcb2a75f058" [label="Mutex"]
}

It took 30 mins in total from starting a spawn host to finding the deadlock.

Comment by Siyuan Zhou [ 13/Feb/17 ]

Here is the latest version of the code, including the change of hang analyzer. It generates the dot format of the dependency graph, which can be visualized here.

This patch on evergreen shows the result of a deadlock in unit test.

Comment by Siyuan Zhou [ 13/Feb/17 ]

I used gdb.parse_and_eval(), but it also doesn't work with core dump.

Comment by Max Hirschhorn [ 12/Feb/17 ]

siyuan.zhou, you had mentioned that you worked on your script again during Skunkworks to avoid calling a function and support performing the automatic deadlock detection on a core dump. Is that code available somewhere?

Generated at Thu Feb 08 04:16:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.