[SERVER-29980] Built-in hang detection diagnostics and recovery Created: 05/Jul/17 Updated: 08/Jan/24 Resolved: 05/Feb/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Diagnostics |
| Affects Version/s: | None |
| Fix Version/s: | 4.2.0 |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Alyson Cabral (Inactive) |
| Resolution: | Fixed | Votes: | 35 |
| Labels: | SWDI | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||
| Description |
|
It would be useful to automatically detect hangs (due for example to software bugs) and produce diagnostics such as complete stack traces for every thread and possibly (depending on degree of confidence that there is a hang) forcefully terminate the instance. |
| Comments |
| Comment by Alyson Cabral (Inactive) [ 05/Feb/20 ] |
|
To reflect the improvements made by moving the storage node watchdog to community in 4.2, I'm closing this ticket. Please open specific server tickets about expanding types of failure checks or any additional improvements going forward. |
| Comment by Adrien Jarthon [ 13/Nov/19 ] |
|
|
| Comment by Danny Hatcher (Inactive) [ 13/Nov/19 ] |
|
You're right; I've opened |
| Comment by Adrien Jarthon [ 13/Nov/19 ] |
|
Oh ok great, that should probably be added to https://docs.mongodb.com/manual/reference/exit-codes/ then |
| Comment by Andy Schwerin [ 13/Nov/19 ] |
|
Code 61 is indeed the watchdog. The watchdog does no logging when it terminates the process, lest it get stuck trying to write to the dead disk. It's surprisingly tricky to maybe-log when the reason you might fail to log is a dead disk, so you've really got to watch the code.
For reference, here's a link to the definition of the exit code. |
| Comment by Adrien Jarthon [ 13/Nov/19 ] |
|
Ok thanks! I managed to upgrade my primary to 4.0 and secondary to 4.2 in order to be able to test the storage watchdog on my server with the faulty SSD (not replaced yet). So I can still reproduce the dead IO performance situation by adding the faulty SSD back into the RAID array, I tried to do that with the watchdog on (60s), but unfortunately I wasn't able to confirm if the watchdog detected anything or stopped mongo because mongo died without logging anything with a status exit code of 61 (not documented). I suspect this is because the log file is also on the dead IO disk (/var) and this caused some internal exceptions / timeout. Is there any way to see metrics about storage test? any message to expect in the logs? what would be the exit status in this case? |
| Comment by Danny Hatcher (Inactive) [ 11/Nov/19 ] |
|
The Watchdog's logic (as of 4.2.1) is very simple. There are two threads, a "check" and a "monitor". The "check" thread constantly writes to a new file and then reads from said file. If it succeeds, it increments a counter. The "monitor" thread runs every watchdogPeriodSeconds and looks at the counter. If the counter is ever the same across two runs of the "monitor" thread, that means that we were unable to write to disk for at minimum the length of watchdogPeriodSeconds and we intentionally shut down the server. |
| Comment by Adrien Jarthon [ 10/Nov/19 ] |
|
That is a great news! I would love to try this (I still have a the faulty SSD in the machine currently) though I'm currently 3.6 so the massive upgrade in a short time span sounds quite dangerous. Is there any more precision about what the watchdog detects as "unresponsive"? If the IO are slow but working for example, what's the test operation and the time threshold if any? |
| Comment by Danny Hatcher (Inactive) [ 08/Nov/19 ] |
|
The Storage Node Watchdog is available for the Community version of the database in 4.2.0. It's off by default but if it was enabled by setting the watchdogPeriodSeconds parameter it's possible that it would have caught the issue you've most recently experienced. However, as Matt said in his last comment, we're still looking into better ways to catch these kinds of issues so your feedback is much appreciated! |
| Comment by Adrien Jarthon [ 08/Nov/19 ] |
|
For the record I just had the same issue again (SSD failure, mongo dead but doesn't know it) and thankfully the little script I wrote saved me from hours of downtime, it stopped mongo which forced failover and only caused about 10 minutes downtime, which is long only because I try to stop mongo gracefully and it takes ages. |
| Comment by Matt Lord (Inactive) [ 15/Apr/19 ] |
|
bigbourin@gmail.com, I'm very sorry to hear about the problems that this caused for you. Your post-mortem analysis is very helpful, so thank you for all of the details! Note that we're currently discussing short term methods that could offer some help in these specific types of cases where the I/O subsystem is stalled for long periods of time, while also discussing medium term plans to address the more general issue where a node is unable to perform meaningful work and cannot make progress for whatever reason. So your input here is very timely for those discussions. Thank you again! |
| Comment by Adrien Jarthon [ 15/Apr/19 ] |
|
Hi, to give a bit more details about the case I encountered:
|
| Comment by Adrien Jarthon [ 03/Apr/19 ] |
|
Hi, I just had another 6.5h outage due to an I/O issue (similar to
Do you guys have any plan of improving this in community edition or will this stay a rich people privilege ( I'm also interested if you have a "recommended" way to monitor this from the outside to force mongo failover. |
| Comment by Max Hirschhorn [ 18/Jul/17 ] |
|
After chatting with pasette about this ticket, I'm moving it over to the Platforms team to triage. |
| Comment by Bruce Lucas (Inactive) [ 08/Jul/17 ] |
|
daniel.hatcher, they are related, but (as written) |