[SERVER-53337] Mongos hangs and stop responding Created: 12/Dec/20 Updated: 04/Feb/21 Resolved: 26/Jan/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.2.5, 4.2.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Ezra Levi | Assignee: | Edwin Zhou |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Case: | (copied to CRM) | ||||||||
| Description |
|
Hello, This issue is happening to us in several PRODUCTION environments and it's very serious. From time to time, mongos service just hangs, applications are unable to connect to ANY of the mongos servers, and the connection just waits and eventually times out.
I connected to the mongos via ssh and tried logging in to mongos, but the issue is the same. From the mongos logs, we can see the following when it started, over and over again:
The issue is being resolved completely when I log in to the primary config server and run the rs.stepDown() command. Once the config primary is changed, everything gets back to normal and connections are coming in. These are the logs that appear in the cfg primary server at the same time:
This issue occurred to us in version 4.2.5, I thought it was similar to https://jira.mongodb.org/browse/SERVER-47553 so I've upgraded to version 4.2.9 and it happens again and again in complete different clusters, which indicates that it is not a specific server or os issue. I've defined this issue as Blocker - P1 since it is affecting multiple PROD environments. |
| Comments |
| Comment by Edwin Zhou [ 26/Jan/21 ] | |||||||||||||||||||||||||||||||||||||
|
Thanks for providing the output of that command. As you noticed, this duplicates The key in question:
MongoDB 4.2.12 was just released with a fix to HMAC keys not renewing as expected. Best, | |||||||||||||||||||||||||||||||||||||
| Comment by Ezra Levi [ 26/Jan/21 ] | |||||||||||||||||||||||||||||||||||||
|
Hi Edwin, I attached below the output from this command. It seems that this was the issue indeed, as the date and time match.
| |||||||||||||||||||||||||||||||||||||
| Comment by Edwin Zhou [ 25/Jan/21 ] | |||||||||||||||||||||||||||||||||||||
|
We still need additional information to diagnose the problem. If this is still an issue for you, could you provide the output of the command below?
Thanks, | |||||||||||||||||||||||||||||||||||||
| Comment by Edwin Zhou [ 11/Jan/21 ] | |||||||||||||||||||||||||||||||||||||
|
We believe you're hitting We can confirm by comparing the timestamps of the incident against the expiration dates of the keys. Could you provide the output of the command below?
Best, | |||||||||||||||||||||||||||||||||||||
| Comment by Ezra Levi [ 17/Dec/20 ] | |||||||||||||||||||||||||||||||||||||
|
Hi Eric, I've uploaded to the portal the diagnostic.data folder for the config servers and for an additional shard. Regarding our topology, we have a standard sharded cluster, which includes two shards with 3 data nodes each, a config replica-set, and three mongos servers. We have an additional completely separate cluster that experienced exactly the same issue, with the same version of MongoDB, however, I don't have its logs to add here. This leads us to believe that this is not a problem with the os or the servers themselves, but something with the MongoDB application. I'd appreciate it if you can help us investigate and get to the root cause of this issue.
| |||||||||||||||||||||||||||||||||||||
| Comment by Eric Sedor [ 16/Dec/20 ] | |||||||||||||||||||||||||||||||||||||
|
Thanks for your patience so far. I've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Would you please upload tar or zip archives of the $dbpath/diagnostic.data directory (contents are described here) for:
It will help if you upload up-to-date/corresponding logs for each of these nodes, provide additional timestamps for incidents that have occurred, and describe the overall sharded cluster topology. Gratefully, |