[SERVER-53337] Mongos hangs and stop responding Created: 12/Dec/20  Updated: 04/Feb/21  Resolved: 26/Jan/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.5, 4.2.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ezra Levi Assignee: Edwin Zhou
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2020-12-22 at 3.35.01 PM.png     PNG File Screen Shot 2020-12-22 at 3.37.33 PM.png     PNG File Screen Shot 2020-12-22 at 4.21.37 PM.png     Zip Archive mongod_mongos_logs.zip    
Issue Links:
Duplicate
duplicates SERVER-52654 new signing keys not generated by the... Closed
Operating System: ALL
Participants:
Case:

 Description   

Hello,

This issue is happening to us in several PRODUCTION environments and it's very serious. 

From time to time, mongos service just hangs, applications are unable to connect to ANY of the mongos servers, and the connection just waits and eventually times out.

System.TimeoutException: A timeout occured after 30000ms selecting a server using CompositeServerSelector{ Selectors = MongoDB.Driver.MongoClient+AreSessionsSupportedServerSelector, LatencyLimitingServerSelector{ AllowedLatencyRange = 00:00:00.0150000 } }. Client view of cluster state is { ClusterId : "1", ConnectionMode : "Automatic", Type : "Unknown", State : "Disconnected", Servers : [{ ServerId: "{ ClusterId : 1, EndPoint : "10.120.32.68:27017" }", EndPoint: "10.120.32.68:27017", ReasonChanged: "Heartbeat", State: "Disconnected", ServerVersion: , TopologyVersion: , Type: "Unknown", HeartbeatException: "MongoDB.Driver.MongoConnectionException: An exception occurred while opening a connection to the server. ---> MongoDB.Driver.MongoConnectionException: An exception occurred while receiving a message from the server. ---> System.TimeoutException: The operation has timed out.

I connected to the mongos via ssh and tried logging in to mongos, but the issue is the same.

From the mongos logs, we can see the following when it started, over and over again:

2020-12-12T08:06:03.901Z I - [conn1257891] operation was interrupted because a client disconnected 
2020-12-12T08:06:03.901Z I NETWORK [conn1257891] DBException handling request, closing client connection: ClientDisconnect: operation was interrupted 
2020-12-12T08:06:03.905Z I NETWORK [conn1302432] received client metadata from 10.248.127.193:18473 conn1302432: { driver: { name: "mongo-csharp-driver", version: "2.11.3.0" }, os: { type: "Linux", name: "Linux 4.15.0-64-generic #73-Ubuntu SMP T

The issue is being resolved completely when I log in to the primary config server and run the rs.stepDown() command. Once the config primary is changed, everything gets back to normal and connections are coming in. 

These are the logs that appear in the cfg primary server at the same time:

2020-12-12T08:06:53.800Z I SHARDING [PeriodicShardedIndexConsistencyChecker] Checking consistency of sharded collection indexes across the cluster 
2020-12-12T08:06:53.837Z I SHARDING [PeriodicShardedIndexConsistencyChecker] Found 0 collections with inconsistent indexes 
2020-12-12T08:07:15.995Z I NETWORK [listener] connection accepted from 10.124.128.43:43410 #320308 (26 connections now open)

This issue occurred to us in version 4.2.5, I thought it was similar to https://jira.mongodb.org/browse/SERVER-47553 so I've upgraded to version 4.2.9 and it happens again and again in complete different clusters, which indicates that it is not a specific server or os issue.

I've defined this issue as Blocker - P1 since it is affecting multiple PROD environments.
The logs from the mongos and the config primary server are attached.



 Comments   
Comment by Edwin Zhou [ 26/Jan/21 ]

Hi ezra.l@sbtech.com,

Thanks for providing the output of that command. As you noticed, this duplicates SERVER-52654 given the coinciding HMAC key expiration and incident date.

The key in question:

{
  "_id" : NumberLong("6867461206597173251"),
  "purpose" : "HMAC",
  "expiresAt" : ISODate("2020-12-12T08:05:32Z")
}

MongoDB 4.2.12 was just released with a fix to HMAC keys not renewing as expected.

Best,
Edwin

Comment by Ezra Levi [ 26/Jan/21 ]

Hi Edwin,

I attached below the output from this command. It seems that this was the issue indeed, as the date and time match.
I've also tested it in two more clusters where the issue occurred as well, and also there I found the correlation between the expiration of the HMAC keys and the time that the issue has happened.

[
        {
                "_id" : NumberLong("6738289717277622301"),
                "purpose" : "HMAC",
                "expiresAt" : ISODate("2019-12-18T08:05:32Z")
        },
        {
                "_id" : NumberLong("6738289717277622302"),
                "purpose" : "HMAC",
                "expiresAt" : ISODate("2020-03-17T08:05:32Z")
        },
        {
                "_id" : NumberLong("6786638297272680495"),
                "purpose" : "HMAC",
                "expiresAt" : ISODate("2020-06-15T08:05:32Z")
        },
        {
                "_id" : NumberLong("6807747430172327939"),
                "purpose" : "HMAC",
                "expiresAt" : ISODate("2020-09-13T08:05:32Z")
        },
        {
                "_id" : NumberLong("6867461206597173251"),
                "purpose" : "HMAC",
                "expiresAt" : ISODate("2020-12-12T08:05:32Z")
        },
        {
                "_id" : NumberLong("6905290595640541186"),
                "purpose" : "HMAC",
                "expiresAt" : ISODate("2021-03-12T08:54:14Z")
        },
        {
                "_id" : NumberLong("6905290595640541187"),
                "purpose" : "HMAC",
                "expiresAt" : ISODate("2021-06-10T08:54:14Z")
        }
]

Comment by Edwin Zhou [ 25/Jan/21 ]

Hi ezra.l@sbtech.com,

We still need additional information to diagnose the problem. If this is still an issue for you, could you provide the output of the command below?

db.getSiblingDB("admin").system.keys.find().map(k => { return { _id: k._id, purpose: k.purpose, expiresAt: new Date(k.expiresAt.getTime()*1000) }})

Thanks,
Edwin

Comment by Edwin Zhou [ 11/Jan/21 ]

Hi ezra.l@sbtech.com,

We believe you're hitting SERVER-52654 which is related to the ticket you linked in the description. To summarize the ticket, communication issues between the mongos and cluster can occur when the mongos' HMAC keys don't renew after expiring after 90 days. It's likely that the HMAC key was renewed after you stepped down the primary.

We can confirm by comparing the timestamps of the incident against the expiration dates of the keys. Could you provide the output of the command below?

db.getSiblingDB("admin").system.keys.find().map(k => { return { _id: k._id, purpose: k.purpose, expiresAt: new Date(k.expiresAt.getTime()*1000) }})

Best,
Edwin

Comment by Ezra Levi [ 17/Dec/20 ]

Hi Eric,

I've uploaded to the portal the diagnostic.data folder for the config servers and for an additional shard.
In addition, I've also uploaded all of the log files from the config servers, the additional shard, and all mongos servers that were impacted at that time.
The issue started on this sharded cluster on  Dec 12, 2020 between 8:00AM-9:00AM server time, until I've stepped down the primary server of the config server.

Regarding our topology, we have a standard sharded cluster, which includes two shards with 3 data nodes each, a config replica-set, and three mongos servers.
They are all running MongoDB 4.2.9

We have an additional completely separate cluster that experienced exactly the same issue, with the same version of MongoDB, however, I don't have its logs to add here. This leads us to believe that this is not a problem with the os or the servers themselves, but something with the MongoDB application.

I'd appreciate it if you can help us investigate and get to the root cause of this issue.

 

Comment by Eric Sedor [ 16/Dec/20 ]

Hi ezra.l@sbtech.com,

Thanks for your patience so far. I've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Would you please upload tar or zip archives of the $dbpath/diagnostic.data directory (contents are described here) for:

  • an impacted mongos
  • the nodes of a representative shard
  • all nodes of the config server replica set

It will help if you upload up-to-date/corresponding logs for each of these nodes, provide additional timestamps for incidents that have occurred, and describe the overall sharded cluster topology.

Gratefully,
Eric

Generated at Thu Feb 08 05:30:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.