[SERVER-31457] Mongod stop responding, takes 200 load and don't even switch to secondary Created: 08/Oct/17  Updated: 07/Nov/17  Resolved: 10/Oct/17

Status: Closed
Project: Core Server
Component/s: Stability, WiredTiger
Affects Version/s: 3.4.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Adrien Jarthon Assignee: Bruce Lucas (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: SWNA
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File cpu-time.png     PNG File disk-usage.png     PNG File mongodb-incident-1.png     PNG File mongodb-incident-2.png    
Issue Links:
Related
is related to SERVER-29980 Built-in hang detection diagnostics a... Closed
is related to SERVER-29947 Implement Storage Node Watchdog Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Impossible to know of course

Participants:

 Description   

Hello,

I just suffered a quite bad issue with mongod 3.4.5 (WT), the requests were totally normal, nothing out of the ordinary and suddenly it started taking up to 200 load on my server, and all 8 CPUs of course:

At this point the server stopped responding to any request, but it seems it kept pinging the secondaries and syncing as it stayed primary until I manually changed the priory from the secondary (I couldn't even SSH on the primary as it was killing the machine).

As I had numerous problem of the kind in the past due tu various performance issues in WT, cache eviction, etc. (SERVER-27700) I tried to let it rest see if it recovers but after 3 hours had to hard reboot the server to get it back...

I checked the logs after the reboot and there was just no single line of log during the 3 hours, and the ones before the crash have nothing weird to me. I collected the diagnostic dir, I can give it to you (and the last hour of logs) if you send me your usual upload link.

If you can access my MongoDB Cloud Manager stats, the project id is: 5012a0ac87d1d86fa8c22e64 otherwise I can give you some screenshots, but there's nothing very interesting as these charts were all totally normal until the agent stopped collecting data.

Thanks for your help



 Comments   
Comment by Bruce Lucas (Inactive) [ 10/Oct/17 ]

Hi Adrien,

OK, I'll close this ticket as you requested. Sorry we weren't able to diagnose the problem, and please do let us know if it occurs again.

Bruce

Comment by Adrien Jarthon [ 09/Oct/17 ]

Ok great, at least I'm not the first one to get a probably similar issue (SERVER-14139) and some measure have already been taken (SERVER-29947) ☺

Unfortunately this feature is for mongodb enterprise so I can't use it ^^ it's sad to think most clusters can die on a - not so uncommon - IO failure but I understand some efforts must go toward financing for the future of mongodb Though it's definitely not making me more willing to pay for mongodb someday if I can't trust it.

I'll start watching SERVER-29980 as this could improve the overall stability of the system and would also have helped with software issues I encountered like SERVER-27700.

Thanks for the gdb command, I would have tried this of course (and strace, etc..) if only I had access to the machine, yes. You can close this ticket then, I don't expect this issue again anytime soon as it was running for couple years on this server. But if I ever manage to get more details I'll let you know.

Comment by Bruce Lucas (Inactive) [ 09/Oct/17 ]

Hi Adrien,

Thanks for the additional info. Regarding your question "where's the separation between system and iotime", I believe the graph labeled "i/o" probaby refers to the CPU iowait statistic that I mentioned above, and would (if present) mean that the CPUs are not busy because they are waiting for i/o.

Instead we see a high utilization of system (kernel) CPU time. Generally I would not expect this to be i/o as such, although it is possible that it is an issue in mongod that is triggered by an i/o problem. We have seen high kernel utilization by mongod in some cases in the past, such as WT-3345, where a WiredTiger was doing excessive spinning while waiting for something. That particular issue is fixed in the version you are running, but it's possible there is an additional undiagnosed issue of this type.

Regarding robustness against storage issues, we recently introduced a new storage watchdog feature that would allow the node to detect a storage problem and fail over to another node; see SERVER-29947 for details. You might wish to enable this if you suspect this is a storage issue and it could occur again.

We also have an open ticket SERVER-29980 to consider how to provide additional robustness by detecting hangs of any kind, but that is a much larger undertaking, not the least of which is how to define and detect a "hang".

Finally, if this were to occur again it would be very valuable to have stack traces from the mongod process, using the following command:

gdb -ex "thread apply all bt" --batch --pid $(pidof mongod) >stacks.txt

I realize however this could be problematic if there is no way to ssh to the machine during the incident.

With the information we have available at this time I don't see any way forward with a diagnosis; if you would like we can leave this ticket open in "waiting for user input" state so that you can provide whatever additional information you can collect if there is a re-occurrence.

Thanks,
Bruce

Comment by Adrien Jarthon [ 09/Oct/17 ]

I check the other charts I have from nixstats (which managed to monitor the machine during most of the time), here is the CPU time:

seems to be system time, not iotime. But i'm not sure where's the separation between system and iotime as io can stem from a lof of different syscalls.

We can also see the disk usage which dropped to almost 0 (the chart shows a couple writes / sec if I hover it) but only mongod uses it so it makes sense.

I agree it's weird that I couldn't access ssh or else during the incident, and I find very strange that not a single application was able to log anything whereas nixstat was still reporting data. It looks to me as if there was an IO error on the machine indeed, and maybe sshd blocked on some log writing when I tried to connect. nixstats may have kept working because it's not writing a single line to disk, atop for example (which I trust to be much more robust than nixstats) didn't log any data after the incident, and when I try to navigate to the time of the incident in the history file, it crashes ☹ so most probably corrupted file.

I can't check much with my system provider, it's a dedicated machine and it's working fine now so they won't find anything.
I'll try to see if I can find any way to detect this kind of issue with more accuracy.

In the meanwhile, there's still something not very pleasant about the way mongod behaved faced with this issue, as it kept pinging the secondary which prevented the failover whereas it was totally stuck for more than 10 minutes. Though I'm not sure how to improve this, it sounds a bit silly to force and IO operation on ping just to make sure IO is working properly.

Comment by Bruce Lucas (Inactive) [ 09/Oct/17 ]

Hi Adrien,

Thanks for uploading the data. Unfortunately we are not finding any information about what went wrong - the data looks normal right up until the point where data collection stops, just as you noted with the Cloud metrics.

It seems unusual that you were completely unable to ssh to the machine, even if it is under heavy load. Note for example that the Ruby process was still able to get a moderate portion of the CPU, so it was not completely starved out by the mongod process. Can you say more about what it was doing during this time? This makes it seem unlikely that that sshd was unresponsive just because of heavy CPU load.

Is it possible that the reported mongod CPU utilization was actually iowait? This represents time that a CPU could not be used because threads were waiting for i/o. It is reported as a CPU time, but it is actually a kind of idle time. I believe threads waiting for i/o also count towards the load average. There's a good article on the subject here. Do you have available metrics that break out CPU utilization by user, kernel, iowait, etc.?

If there were a system-level storage problem that could possibly affect mongod in a way consistent with the data, and also prevent sshd from accepting connections. Can you check with your service provider about this?

Thanks,
Bruce

Comment by Bruce Lucas (Inactive) [ 08/Oct/17 ]

Thanks Adrien, we will take a look at the data.

Comment by Adrien Jarthon [ 08/Oct/17 ]

Hi Bruce,

Sure, I uploaded:

  • The mongodb log for the hour of incident (19h to last written line at 19h50)
  • The syslog file for the same period + the next line 4.5 hours later
  • The diagnostic.data archive

The charts are in CEST (GMT+2) timezone, so the incident started at 5:50 pm UTC and the hardware reboot occurred at 10:22 pm.
Well it's the first time I encounter this specific but I had a similar one a few months back, where mongod would put my machine at 1000 load (yep) but without actually using the CPU (< 1 core used) this was very weird, as if 1000 threads were waiting on something but not consuming ressources, I'm not even sure how it's possible. And this time it properly failed over to secondary, I restarted the process and all was good, in the log file I saw some kind of critical exception (segfault or similar) but only when I stopped the server not when the problem started. I didn't report the issue back then because I deleted the data dir to re-sync the server and forgot about copying the diagnostic.data dir so it would have been hard for you to diagnose.

About the upgrades I upgraded the primary to 3.4.9 tonight after the incident but otherwise it was running 3.4.5 since June 20th, and it was previously on 3.4.4 since May 20th.

Let me know if there's anything else I can help you with !

Comment by Bruce Lucas (Inactive) [ 08/Oct/17 ]

Hi Adrien,

Thanks for the report, and sorry that you encountered this problem.

Can you please upload an archive of diagnostic.data and (if possible) the entire mongod log file covering the time of the issue here.

Also can you please tell us

  • What is the timezone of the times in the charts that you posted?
  • Is this the first time you've encountered this specific issue?
  • What is the recent mongod upgrade history of this node?
  • Are there any messages in syslog (/var/log/messages or /var/log/syslog) around the time of the incident?

Thanks,
Bruce

Generated at Thu Feb 08 04:27:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.