[SERVER-34190] MongoDB process hangs after some random time Created: 29/Mar/18 Updated: 24/Apr/18 Resolved: 10/Apr/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 3.4.13, 3.4.14, 3.6.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Piotr Klimek | Assignee: | Kelsey Schubert |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Steps To Reproduce: | Not found yet. |
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
We have a problem with mongodb (3.6.3) PRIMARY server. After some time it gets to a state where it is still PRIMARY but it is not accepting connections. The problem is that it keeps PRIMARY role and because of that our app crashes. Mongodb restart on PRIMARY server helps and everything backs to normal. We are hosting mongodb in Amazon on 3 Ubuntu m5.4xlarge instances with 3000 IOPS EBS volumes. During the crash we have ~30% more connections to MongoDB than usual, but they are still far below the limits and far below fs.file-max setting that is set to 6430188. No other metric looks suspicious. RAM, CPU, Disk and Network usage are on the same level as just before crash and right after restart of PRIMARY. We have already migrate MongoDB from 3.4.14 to 3.6.3 and problem still occurs every 1-2 days. We have also changed priority for PRIMARY server and migrate this role to another host so it’s not connected to any specific machine. There is nothing interesting on logs. Here is the output of some commands that we run when the server was in not responsive state:
Any idea what else should we check to debug it? |
| Comments |
| Comment by Kelsey Schubert [ 10/Apr/18 ] | ||||
|
Hi pklimek, Thank you for the stacks. After review, it appears that you are encountering Kind regards, | ||||
| Comment by Piotr Klimek [ 04/Apr/18 ] | ||||
|
Hi, We have already changed:
All this changes leads us just to another crash and those crashes are happening every 24-48 hours. | ||||
| Comment by Kelsey Schubert [ 04/Apr/18 ] | ||||
|
Hi pklimek, Unfortunately, we'll need the stacktraces generated by copy pasting the command when the mongod is hung to determine the root cause of this behavior:
I'm not sure exactly occurred with the previous invocation of gdb. It's possible that it was interrupted. Would you please try to collect gdb stack traces again? Please use the command that I have provided without modification (unless there are multiple running mongods) as the the date timestamp lets us know when the command was executed relative to the other diagnostic information provided. The complete log files would also help give us more context, but I'm afraid that without the gdb stacktraces, we'll struggle to determine the root cause. Thank you, | ||||
| Comment by Piotr Klimek [ 31/Mar/18 ] | ||||
|
I'm sorry, somehow I've missed this. Tar archive is already uploaded. | ||||
| Comment by Kelsey Schubert [ 31/Mar/18 ] | ||||
|
Thanks pklimek, we'll take a look. Could you also provide an archive of the diagnostic.data as I previously requested? Thanks again, | ||||
| Comment by Piotr Klimek [ 31/Mar/18 ] | ||||
|
Today during the night server crashed again, I have the logs from GDB, (I've uploaded them to the same place with rest of the logs), but those logs looks weird, its 53k lines of this:
Of course numbers are changing, but there is nothing more than that. | ||||
| Comment by Piotr Klimek [ 30/Mar/18 ] | ||||
|
Hello Kelsey,
As you can see there is 30 seconds gap in logs which is quite strange on heavy loaded production database, after this gap there are only new connections logged until database is restarted. | ||||
| Comment by Kelsey Schubert [ 29/Mar/18 ] | ||||
|
Hi pklimek, Thank you for reporting this issue. Would you please upload the logs and an archive of the diagnostic.data directory from an affected node? I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time. If this diagnostic information is insufficient to identify the root cause, my next recommendation would be to collect gdb using the following command:
To speed our investigation, would you please collect gdb from the unresponsive node the next time this issue occurs and upload the file? Thank you for your help, |