[SERVER-34487] mongod server does not accept new connections after random time period Created: 15/Apr/18 Updated: 15/Sep/18 Resolved: 23/Aug/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking, Sharding, Stability |
| Affects Version/s: | 3.4.14 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Gal Ben David | Assignee: | Kelsey Schubert |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
Our mongo cluster consists of 6 different servers, each one contains 1 config server, 1 mongos, and 1 mongod servers.
Each of our servers run at it's startup the following commands:
The mongos log looks like this:
I tried to connect to the mongod directly, and I've got a connection timed out as well. I'm attaching strace of the mongod process from inside the docker container: |
| Comments |
| Comment by Kelsey Schubert [ 22/Aug/18 ] | ||||||||||||||||||||||||||
|
Hi benshu, Thanks for providing the additional detail. After reviewing the gdb output, this appears to be a system issue beneath mongod. We can see that write is stuck at the system level:
From an analysis of gdb, there are 441 threads piled up behind this operation waiting to write to the log. Consequently, as soon as a new connection is established, it hangs waiting for the logger. I'd recommend reviewing syslogs or dmesg for clues about the root cause of this behavior. Kind regards, | ||||||||||||||||||||||||||
| Comment by Hagay Ben Shushan [ 22/Aug/18 ] | ||||||||||||||||||||||||||
|
Hi Keylsey, I've uploaded the proper shard-1 logs. The issue began at 2018-08-14T06:04:07 It was resolved at ~07:20 the same day, only by restarting the mongod on shard-1, which was the affected node.
Thanks, | ||||||||||||||||||||||||||
| Comment by Kelsey Schubert [ 21/Aug/18 ] | ||||||||||||||||||||||||||
|
Hi benshu, To help speed our investigation as we look through the data you've provided, would you please clarify exactly when the issue began, when it was resolved, and which node was affected? Would you please also double-check the logs on shard-1, and reupload the files for this node as the logs we've received for this node are corrupted. Thank you, | ||||||||||||||||||||||||||
| Comment by Hagay Ben Shushan [ 20/Aug/18 ] | ||||||||||||||||||||||||||
|
Hi, I'm an associate of Gal Ben David, and have uploaded all the required information via the upload portal. I would appreciate if you could reopen the issue as it is still relevant.
Thanks, | ||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 21/Jun/18 ] | ||||||||||||||||||||||||||
|
Hi wavenator, We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide the requested information and we will reopen the ticket. Thanks, | ||||||||||||||||||||||||||
| Comment by Gal Ben David [ 27/May/18 ] | ||||||||||||||||||||||||||
|
Hi Dimitry. The logs were full. This is part of the problem. The mongo instance stopped logging when the problem occurs. I'll try to provide from all the instances, although it could take some time. | ||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 27/May/18 ] | ||||||||||||||||||||||||||
|
Hi wavenator, We still need some additional information to progress this investigation. If this is still an issue for you, can you please provide the following covering the time of the incident:
Thanks, | ||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 09/May/18 ] | ||||||||||||||||||||||||||
|
Hi wavenator, Unfortunately, the provided mongos and mongod logs do not correlate with the {{diagnostic.data}.
Nevertheless, I was able to spot something suspicious in the provided diagnostic.data. In order to progress this investigation, please provide the following covering the time of the incident:
Even though I have a good suspect at the moment, I need to correlate all the requested data. In addition, I'd like to understand why the reported issue is impacting only specific shard's Primary. Thanks, | ||||||||||||||||||||||||||
| Comment by Kelsey Schubert [ 23/Apr/18 ] | ||||||||||||||||||||||||||
|
Thanks for uploading the files, wavenator – we're investigating. | ||||||||||||||||||||||||||
| Comment by Gal Ben David [ 19/Apr/18 ] | ||||||||||||||||||||||||||
|
Ok, I'll work on getting all this information ASAP. Edit: | ||||||||||||||||||||||||||
| Comment by Kelsey Schubert [ 16/Apr/18 ] | ||||||||||||||||||||||||||
|
Hi wavenator, Thank you for reporting this issue. So we can continue to investigate, would you please provide the following information:
Since these files may exceed the maximum upload limit to JIRA, I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time. Thank you for your help, |