[SERVER-29980] Built-in hang detection diagnostics and recovery Created: 05/Jul/17  Updated: 08/Jan/24  Resolved: 05/Feb/20

Status: Closed
Project: Core Server
Component/s: Diagnostics
Affects Version/s: None
Fix Version/s: 4.2.0

Type: New Feature Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Alyson Cabral (Inactive)
Resolution: Fixed Votes: 35
Labels: SWDI
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-31457 Mongod stop responding, takes 200 loa... Closed
is related to SERVER-14139 Disk failure on one node can (eventua... Closed
is related to DOCS-13223 Add Storage Watchdog Exit Code Closed
Backwards Compatibility: Fully Compatible
Participants:
Case:

 Description   

It would be useful to automatically detect hangs (due for example to software bugs) and produce diagnostics such as complete stack traces for every thread and possibly (depending on degree of confidence that there is a hang) forcefully terminate the instance.



 Comments   
Comment by Alyson Cabral (Inactive) [ 05/Feb/20 ]

To reflect the improvements made by moving the storage node watchdog to community in 4.2, I'm closing this ticket. Please open specific server tickets about expanding types of failure checks or any additional improvements going forward.

Comment by Adrien Jarthon [ 13/Nov/19 ]

Comment by Danny Hatcher (Inactive) [ 13/Nov/19 ]

You're right; I've opened DOCS-13223 to get it added.

Comment by Adrien Jarthon [ 13/Nov/19 ]

Oh ok great, that should probably be added to https://docs.mongodb.com/manual/reference/exit-codes/ then
Then it worked fine for my case

Comment by Andy Schwerin [ 13/Nov/19 ]

Code 61 is indeed the watchdog. The watchdog does no logging when it terminates the process, lest it get stuck trying to write to the dead disk. It's surprisingly tricky to maybe-log when the reason you might fail to log is a dead disk, so you've really got to watch the code.

 

For reference, here's a link to the definition of the exit code.

Comment by Adrien Jarthon [ 13/Nov/19 ]

Ok thanks! I managed to upgrade my primary to 4.0 and secondary to 4.2 in order to be able to test the storage watchdog on my server with the faulty SSD (not replaced yet). So I can still reproduce the dead IO performance situation by adding the faulty SSD back into the RAID array, I tried to do that with the watchdog on (60s), but unfortunately I wasn't able to confirm if the watchdog detected anything or stopped mongo because mongo died without logging anything with a status exit code of 61 (not documented). I suspect this is because the log file is also on the dead IO disk (/var) and this caused some internal exceptions / timeout. Is there any way to see metrics about storage test? any message to expect in the logs? what would be the exit status in this case?

Comment by Danny Hatcher (Inactive) [ 11/Nov/19 ]

The Watchdog's logic (as of 4.2.1) is very simple. There are two threads, a "check" and a "monitor". The "check" thread constantly writes to a new file and then reads from said file. If it succeeds, it increments a counter. The "monitor" thread runs every watchdogPeriodSeconds and looks at the counter. If the counter is ever the same across two runs of the "monitor" thread, that means that we were unable to write to disk for at minimum the length of watchdogPeriodSeconds and we intentionally shut down the server.

Comment by Adrien Jarthon [ 10/Nov/19 ]

That is a great news! I would love to try this (I still have a the faulty SSD in the machine currently) though I'm currently 3.6 so the massive upgrade in a short time span sounds quite dangerous. Is there any more precision about what the watchdog detects as "unresponsive"? If the IO are slow but working for example, what's the test operation and the time threshold if any?

Comment by Danny Hatcher (Inactive) [ 08/Nov/19 ]

The Storage Node Watchdog is available for the Community version of the database in 4.2.0. It's off by default but if it was enabled by setting the watchdogPeriodSeconds parameter it's possible that it would have caught the issue you've most recently experienced. However, as Matt said in his last comment, we're still looking into better ways to catch these kinds of issues so your feedback is much appreciated!

Comment by Adrien Jarthon [ 08/Nov/19 ]

For the record I just had the same issue again (SSD failure, mongo dead but doesn't know it) and thankfully the little script I wrote saved me from hours of downtime, it stopped mongo which forced failover and only caused about 10 minutes downtime, which is long only because I try to stop mongo gracefully and it takes ages.

Comment by Matt Lord (Inactive) [ 15/Apr/19 ]

bigbourin@gmail.com, I'm very sorry to hear about the problems that this caused for you. Your post-mortem analysis is very helpful, so thank you for all of the details!

Note that we're currently discussing short term methods that could offer some help in these specific types of cases where the I/O subsystem is stalled for long periods of time, while also discussing medium term plans to address the more general issue where a node is unable to perform meaningful work and cannot make progress for whatever reason. So your input here is very timely for those discussions. Thank you again!

Comment by Adrien Jarthon [ 15/Apr/19 ]

Hi, to give a bit more details about the case I encountered:

  • In my case the SSD disk were unresponsive to WRITE only (they where fast with read). They were still responding but more than 100 times slower, which creates delays of sometimes several minute to execute a simple command (like db.getReplicationInfo). This is amplified by the writing of logs from mongo I guess.
  • When this happened I tried to do a stepDown manually but had two issues: once I waited for 5 minutes and the command never respond I had to Ctrl+C it (probably waiting on disk), another time it was always returning after 10 seconds with the error saying the secondary isn't up-to-date, whereas it was as up-to-date as possible with the primary (but the primary was lagging so maybe this breaks your condition?). I had to force the stepDown to make it happen. Basically the primary and secondary were both at the same time, which was a couple hours in the past.
  • I made a little script to detect these senarios and try to force shutdown on my side, I used db.getReplicationInfo() as a command executed periodically and if it takes more than 2 seconds to answer up to 5 times (10s between each attempts) I consider the mongo in a bad state and stop it. (I don't even try the stepDown as it's not reliable). I've tried this process once with my current setup and it worked.
  • I still have one faulty SSD (not replaced yet) in my server so I can try some patches or commands if you want to see how it behaves in this situation. I can toggle the problem on and off by adding/removing the faulty disk to the RAID array.
Comment by Adrien Jarthon [ 03/Apr/19 ]

Hi, I just had another 6.5h outage due to an I/O issue (similar to SERVER-31457 or SERVER-14139) but this time I was able to SSH on the machine so I confirmed one disk was failing which made the RAID array 200 times slower than usual and mongo stopped responding without falling back to secondary as expected unfortunately. Funny thing is that I got the alerts from CloudManager saying the node was not responding, but for the RS heartbeat it was still ok and stayed Primary... 

 

Do you guys have any plan of improving this in community edition or will this stay a rich people privilege (SERVER-29947)?

I'm also interested if you have a "recommended" way to monitor this from the outside to force mongo failover.

Comment by Max Hirschhorn [ 18/Jul/17 ]

After chatting with pasette about this ticket, I'm moving it over to the Platforms team to triage.

Comment by Bruce Lucas (Inactive) [ 08/Jul/17 ]

daniel.hatcher, they are related, but (as written) SERVER-14139 is narrower.

Generated at Thu Feb 08 04:22:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.