[SERVER-45133] Dump condition variable state in hang analyzer Created: 13/Dec/19 Updated: 08/Jan/24 Resolved: 10/Apr/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 4.7.0 |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Spencer Brody (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Sprint: | Service Arch 2020-03-23, Service Arch 2020-04-06, Service arch 2020-04-20 | ||||
| Participants: | |||||
| Description |
|
The Diagnostics for Latches project gave us more information on condition variables. It would be great to include these in the hang analyzer. |
| Comments |
| Comment by Githook User [ 14/Apr/20 ] | ||||||||||||||||||||||||||||
|
Author: {'name': 'Spencer T Brody', 'email': 'spencer@mongodb.com', 'username': 'stbrody'}Message: | ||||||||||||||||||||||||||||
| Comment by Githook User [ 09/Apr/20 ] | ||||||||||||||||||||||||||||
|
Author: {'name': 'Spencer T Brody', 'email': 'spencer@10gen.com', 'username': 'stbrody'}Message: This reverts commit cbf705e93a73052045e7157380748198b427bcd5. | ||||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 09/Apr/20 ] | ||||||||||||||||||||||||||||
|
Well I re-ran the mongodb-dump-mutexes gdb command against a core dump from the failure that Billy linked, but it passed without issue. So now I have no idea what the problem was. I'm just going to go ahead and re-push this and see if it happens again. | ||||||||||||||||||||||||||||
| Comment by Benjamin Caimano (Inactive) [ 09/Apr/20 ] | ||||||||||||||||||||||||||||
|
That is possible. The actual registration for the decoration happens here. It's pretty early before we fork, but that doesn't mean it isn't too early. | ||||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 09/Apr/20 ] | ||||||||||||||||||||||||||||
|
I no longer believe this to be an issue of mismatch in stdlib version. The failure that Billy linked actually has other calls to the hang_analyzer against other mongods in it, and those all work successfully. The one that fails is run against the mongod running on port 22021, which seems to be the source of the hang that triggered the hang analyzer in the first place. This mongod is restarted
starts wired tiger recovery:
and then we don't hear from it again until the hang analyzer starts running, almost 20 minutes later
According to this node's stack trace it never got very far in the startup process at all; it is still stuck in storage engine startup:
So maybe the issue is that the DiagnosticInfo decoration never got initialized? | ||||||||||||||||||||||||||||
| Comment by Githook User [ 08/Apr/20 ] | ||||||||||||||||||||||||||||
|
Author: {'name': 'Spencer T Brody', 'email': 'spencer@10gen.com', 'username': 'stbrody'}Message: Revert " This reverts commit 3fc737f00d47fd2c59683579119c2083541ea5a2. | ||||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 08/Apr/20 ] | ||||||||||||||||||||||||||||
|
Hmm... I don't think the issue is the DiagnosticInfoHandle being missing, since we get past the line where we dereference the DiagnosticInfoHandle to get the the underlying list of DiagnosticInfo objects. That list being empty shouldn't be a problem since we iterate over the list and just won't enter the loop if the list is empty. The main error from the stack trace appears to be:
This seems to be coming out of the stdlib pretty printers. Given that it actually knows the type of 'diagnostic_info_list', the list must exist. The issue seems to be the pretty printer is expecting the std::forward_list to have an '_Node' member. Is it possible there's a mismatch in the implementation of the standard library being used and the pretty printer we are using? This is the line where we import the pretty printer. | ||||||||||||||||||||||||||||
| Comment by Billy Donahue [ 08/Apr/20 ] | ||||||||||||||||||||||||||||
|
reopening as we might need a small tweak to add an exception catch to handle the case with no DiagnosticInfo? | ||||||||||||||||||||||||||||
| Comment by Billy Donahue [ 08/Apr/20 ] | ||||||||||||||||||||||||||||
|
[The /sh101874/ lines in the log are the hang analyzer, included below] The mutex dumper in the hang analyzer is crapping out on RHEL6.2 in evergreen. Would be good if we handled that case gracefully.
| ||||||||||||||||||||||||||||
| Comment by Githook User [ 07/Apr/20 ] | ||||||||||||||||||||||||||||
|
Author: {'name': 'Spencer T Brody', 'email': 'spencer@10gen.com', 'username': 'stbrody'}Message: | ||||||||||||||||||||||||||||
| Comment by Benjamin Caimano (Inactive) [ 03/Apr/20 ] | ||||||||||||||||||||||||||||
|
Yeah, it is correct. We have SERVER-45059 to be able to distinguish between lock and waitForConditionOrInterrupt. | ||||||||||||||||||||||||||||
| Comment by Misha Tyulenev [ 03/Apr/20 ] | ||||||||||||||||||||||||||||
|
spencer it sounds correct, CC ben.caimano | ||||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 02/Apr/20 ] | ||||||||||||||||||||||||||||
|
Actually, it looks like if you are sleeping on a condition variable, you will still show up as waiting on the mutex that you released to go to sleep on the condvar. It looks like there's no way to distinguish a thread that is blocked waiting on acquiring a mutex from one that has successfully acquired a mutex and is now waiting on a condition variable (which will implicitly unlock the mutex). misha.tyulenev, is that expected? | ||||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 01/Apr/20 ] | ||||||||||||||||||||||||||||
|
The title of this ticket is asking about dumping condvar state, but it's worth noting that the "Diagnostics for Latches" gave us info on latches (mutexes) not condition variables. If a thread is trying to wake up from a condition variable sleep but can't because it can't acquire the mutex it needs, then that will show up here, but if a thread is sleeping waiting on a condition variable because they're waiting for some arbitrary condition to become true, that won't be captured here. | ||||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 25/Mar/20 ] | ||||||||||||||||||||||||||||
|
Given that this information is already captured in currentOp, I wonder if we would be better served by turning this ticket into "dump currentOp() in hang analyzer"? mira.carey@mongodb.com | ||||||||||||||||||||||||||||
| Comment by Ian Whalen (Inactive) [ 06/Jan/20 ] | ||||||||||||||||||||||||||||
|
Assigning to Service Arch given that Service Arch's project was to help developers with using latches, we believe that this makes a lot more sense to live with the service arch team (also because they just generally will have a lot more context around how to do this.) |