[SERVER-71520] Dump all thread stacks on RSTL acquisition timeout Created: 21/Nov/22  Updated: 28/Dec/23  Resolved: 08/Sep/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.2.0-rc0, 7.0.2, 7.1.0-rc2, 6.0.11

Type: Task Priority: Major - P3
Reporter: Lingzhi Deng Assignee: Frederic Vitzikam
Resolution: Fixed Votes: 0
Labels: repl-shortlist
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-76932 Add a way for a thread to know when t... Closed
Related
related to SERVER-56756 Primary cannot stepDown when experien... Closed
related to SERVER-61251 Ensure long running storage engine op... Backlog
is related to SERVER-71521 Improve currentOp to include more pro... Open
Assigned Teams:
Replication
Backwards Compatibility: Fully Compatible
Backport Requested:
v7.1, v7.0, v6.0, v5.0, v4.4
Sprint: Repl 2023-03-06, Repl 2023-03-20, Repl 2023-05-01, Repl 2023-05-15, Repl 2023-05-29, Repl 2023-06-12, Repl 2023-06-26, Repl 2023-07-24, Repl 2023-08-07, Repl 2023-08-21, Repl 2023-09-04, Repl 2023-09-18
Participants:

 Description   

SERVER-56756 added an fassert to crash the server when it times out on acquiring the RSTL lock on stepUp/stepDown. We currently dump all locks before the fassert. But sometimes, the lock manager dump isn't sufficient for diagnosing the underlying issues. Most of the time, a core dump is needed to understand what are all of the current running ops and what are they doing. Ideally, it'd be helpful if we can just dump the stacktraces (printAllThreadStacks) but that's not always feasible especially on production builds. One alternative way to do this is to selectively dump currentOp (and maybe the session catalog as well).

SERVER-71521 is an improvement of currentOp that may help with this.

Update: see conversation, we decided to dump all thread stacks.



 Comments   
Comment by Githook User [ 20/Sep/23 ]

Author:

{'name': 'Frederic Vitzikam', 'email': 'frederic.vitzikam@mongodb.com', 'username': 'fredvitz'}

Message: SERVER-71520 Dump all thread stacks on RSTL acquisition timeout
Branch: v6.0
https://github.com/mongodb/mongo/commit/7f1579c771cffa24a30b5fabc6fb3182dc79351b

Comment by Githook User [ 13/Sep/23 ]

Author:

{'name': 'Frederic Vitzikam', 'email': 'frederic.vitzikam@mongodb.com', 'username': 'fredvitz'}

Message: SERVER-71520 Dump all thread stacks on RSTL acquisition timeout
Branch: v7.0
https://github.com/mongodb/mongo/commit/a803a33ecf3ed6ce9d762ea151257d0a6d2e2041

Comment by Githook User [ 12/Sep/23 ]

Author:

{'name': 'Frederic Vitzikam', 'email': 'frederic.vitzikam@mongodb.com', 'username': 'fredvitz'}

Message: SERVER-71520 Dump all thread stacks on RSTL acquisition timeout.
Branch: v7.1
https://github.com/mongodb/mongo/commit/f61a333becaa4e7cfed912b446c2ce547c3e46d5

Comment by Frederic Vitzikam [ 08/Sep/23 ]

I checked v7.1 too because SERVER-71520 ended up requiring 2 PRs but the first one (using fassertFailedNoTrace, which we later found out suppress core dump too) went through v7.1 already. Lingzhi and I think it is better to get the second half too in this case.

Comment by Githook User [ 30/Aug/23 ]

Author:

{'name': 'Frederic Vitzikam', 'email': 'frederic.vitzikam@mongodb.com', 'username': 'fredvitz'}

Message: SERVER-71520 Dump all thread stacks on RSTL acquisition timeout.
Branch: master
https://github.com/mongodb/mongo/commit/a27f8aa4245082366e228b3d10e10e2baa4d601e

Comment by Githook User [ 28/Aug/23 ]

Author:

{'name': 'Frederic Vitzikam', 'email': 'frederic.vitzikam@mongodb.com', 'username': 'fredvitz'}

Message: SERVER-71520 Dump all thread stacks on RSTL acquisition timeout
Branch: master
https://github.com/mongodb/mongo/commit/1d79a4a974bad7b049db839422b4b79cfd92044e

Comment by Billy Donahue [ 01/May/23 ]

I don't think there's any (or at least not much) special work to do here.

Any thread can make the SignalHandler do its normal stack dumping via:
kill(getpid(), SIGUSR2)

The trick might be figuring out when it's finished.
That might require some new synchronization with the SignalHandler thread.

Comment by Judah Schvimer [ 21/Nov/22 ]

We should double check if there's a way to print stack traces in production when we go to implement this.

Generated at Thu Feb 08 06:19:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.