[SERVER-31457] Mongod stop responding, takes 200 load and don't even switch to secondary Created: 08/Oct/17 Updated: 07/Nov/17 Resolved: 10/Oct/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability, WiredTiger |
| Affects Version/s: | 3.4.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Adrien Jarthon | Assignee: | Bruce Lucas (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | SWNA | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Steps To Reproduce: | Impossible to know of course |
||||||||||||
| Participants: | |||||||||||||
| Description |
|
Hello, I just suffered a quite bad issue with mongod 3.4.5 (WT), the requests were totally normal, nothing out of the ordinary and suddenly it started taking up to 200 load on my server, and all 8 CPUs of course: At this point the server stopped responding to any request, but it seems it kept pinging the secondaries and syncing as it stayed primary until I manually changed the priory from the secondary (I couldn't even SSH on the primary as it was killing the machine). As I had numerous problem of the kind in the past due tu various performance issues in WT, cache eviction, etc. ( I checked the logs after the reboot and there was just no single line of log during the 3 hours, and the ones before the crash have nothing weird to me. I collected the diagnostic dir, I can give it to you (and the last hour of logs) if you send me your usual upload link. If you can access my MongoDB Cloud Manager stats, the project id is: 5012a0ac87d1d86fa8c22e64 otherwise I can give you some screenshots, but there's nothing very interesting as these charts were all totally normal until the agent stopped collecting data. Thanks for your help |
| Comments |
| Comment by Bruce Lucas (Inactive) [ 10/Oct/17 ] | |
|
Hi Adrien, OK, I'll close this ticket as you requested. Sorry we weren't able to diagnose the problem, and please do let us know if it occurs again. Bruce | |
| Comment by Adrien Jarthon [ 09/Oct/17 ] | |
|
Ok great, at least I'm not the first one to get a probably similar issue ( Unfortunately this feature is for mongodb enterprise so I can't use it ^^ it's sad to think most clusters can die on a - not so uncommon - IO failure but I understand some efforts must go toward financing for the future of mongodb I'll start watching Thanks for the gdb command, I would have tried this of course (and strace, etc..) if only I had access to the machine, yes. You can close this ticket then, I don't expect this issue again anytime soon as it was running for couple years on this server. But if I ever manage to get more details I'll let you know. | |
| Comment by Bruce Lucas (Inactive) [ 09/Oct/17 ] | |
|
Hi Adrien, Thanks for the additional info. Regarding your question "where's the separation between system and iotime", I believe the graph labeled "i/o" probaby refers to the CPU iowait statistic that I mentioned above, and would (if present) mean that the CPUs are not busy because they are waiting for i/o. Instead we see a high utilization of system (kernel) CPU time. Generally I would not expect this to be i/o as such, although it is possible that it is an issue in mongod that is triggered by an i/o problem. We have seen high kernel utilization by mongod in some cases in the past, such as Regarding robustness against storage issues, we recently introduced a new storage watchdog feature that would allow the node to detect a storage problem and fail over to another node; see We also have an open ticket Finally, if this were to occur again it would be very valuable to have stack traces from the mongod process, using the following command:
I realize however this could be problematic if there is no way to ssh to the machine during the incident. With the information we have available at this time I don't see any way forward with a diagnosis; if you would like we can leave this ticket open in "waiting for user input" state so that you can provide whatever additional information you can collect if there is a re-occurrence. Thanks, | |
| Comment by Adrien Jarthon [ 09/Oct/17 ] | |
|
I check the other charts I have from nixstats (which managed to monitor the machine during most of the time), here is the CPU time: We can also see the disk usage which dropped to almost 0 (the chart shows a couple writes / sec if I hover it) but only mongod uses it so it makes sense. I agree it's weird that I couldn't access ssh or else during the incident, and I find very strange that not a single application was able to log anything whereas nixstat was still reporting data. It looks to me as if there was an IO error on the machine indeed, and maybe sshd blocked on some log writing when I tried to connect. nixstats may have kept working because it's not writing a single line to disk, atop for example (which I trust to be much more robust than nixstats) didn't log any data after the incident, and when I try to navigate to the time of the incident in the history file, it crashes ☹ so most probably corrupted file. I can't check much with my system provider, it's a dedicated machine and it's working fine now so they won't find anything. In the meanwhile, there's still something not very pleasant about the way mongod behaved faced with this issue, as it kept pinging the secondary which prevented the failover whereas it was totally stuck for more than 10 minutes. Though I'm not sure how to improve this, it sounds a bit silly to force and IO operation on ping just to make sure IO is working properly. | |
| Comment by Bruce Lucas (Inactive) [ 09/Oct/17 ] | |
|
Hi Adrien, Thanks for uploading the data. Unfortunately we are not finding any information about what went wrong - the data looks normal right up until the point where data collection stops, just as you noted with the Cloud metrics. It seems unusual that you were completely unable to ssh to the machine, even if it is under heavy load. Note for example that the Ruby process was still able to get a moderate portion of the CPU, so it was not completely starved out by the mongod process. Can you say more about what it was doing during this time? This makes it seem unlikely that that sshd was unresponsive just because of heavy CPU load. Is it possible that the reported mongod CPU utilization was actually iowait? This represents time that a CPU could not be used because threads were waiting for i/o. It is reported as a CPU time, but it is actually a kind of idle time. I believe threads waiting for i/o also count towards the load average. There's a good article on the subject here. Do you have available metrics that break out CPU utilization by user, kernel, iowait, etc.? If there were a system-level storage problem that could possibly affect mongod in a way consistent with the data, and also prevent sshd from accepting connections. Can you check with your service provider about this? Thanks, | |
| Comment by Bruce Lucas (Inactive) [ 08/Oct/17 ] | |
|
Thanks Adrien, we will take a look at the data. | |
| Comment by Adrien Jarthon [ 08/Oct/17 ] | |
|
Hi Bruce, Sure, I uploaded:
The charts are in CEST (GMT+2) timezone, so the incident started at 5:50 pm UTC and the hardware reboot occurred at 10:22 pm. About the upgrades I upgraded the primary to 3.4.9 tonight after the incident but otherwise it was running 3.4.5 since June 20th, and it was previously on 3.4.4 since May 20th. Let me know if there's anything else I can help you with ! | |
| Comment by Bruce Lucas (Inactive) [ 08/Oct/17 ] | |
|
Hi Adrien, Thanks for the report, and sorry that you encountered this problem. Can you please upload an archive of diagnostic.data and (if possible) the entire mongod log file covering the time of the issue here. Also can you please tell us
Thanks, |