-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 4.2.5, 4.2.9
-
Component/s: None
-
None
-
ALL
-
(copied to CRM)
Hello,
This issue is happening to us in several PRODUCTION environments and it's very serious.
From time to time, mongos service just hangs, applications are unable to connect to ANY of the mongos servers, and the connection just waits and eventually times out.
System.TimeoutException: A timeout occured after 30000ms selecting a server using CompositeServerSelector{ Selectors = MongoDB.Driver.MongoClient+AreSessionsSupportedServerSelector, LatencyLimitingServerSelector{ AllowedLatencyRange = 00:00:00.0150000 } }. Client view of cluster state is { ClusterId : "1", ConnectionMode : "Automatic", Type : "Unknown", State : "Disconnected", Servers : [{ ServerId: "{ ClusterId : 1, EndPoint : "10.120.32.68:27017" }", EndPoint: "10.120.32.68:27017", ReasonChanged: "Heartbeat", State: "Disconnected", ServerVersion: , TopologyVersion: , Type: "Unknown", HeartbeatException: "MongoDB.Driver.MongoConnectionException: An exception occurred while opening a connection to the server. ---> MongoDB.Driver.MongoConnectionException: An exception occurred while receiving a message from the server. ---> System.TimeoutException: The operation has timed out.
I connected to the mongos via ssh and tried logging in to mongos, but the issue is the same.
From the mongos logs, we can see the following when it started, over and over again:
2020-12-12T08:06:03.901Z I - [conn1257891] operation was interrupted because a client disconnected 2020-12-12T08:06:03.901Z I NETWORK [conn1257891] DBException handling request, closing client connection: ClientDisconnect: operation was interrupted 2020-12-12T08:06:03.905Z I NETWORK [conn1302432] received client metadata from 10.248.127.193:18473 conn1302432: { driver: { name: "mongo-csharp-driver", version: "2.11.3.0" }, os: { type: "Linux", name: "Linux 4.15.0-64-generic #73-Ubuntu SMP T
The issue is being resolved completely when I log in to the primary config server and run the rs.stepDown() command. Once the config primary is changed, everything gets back to normal and connections are coming in.
These are the logs that appear in the cfg primary server at the same time:
2020-12-12T08:06:53.800Z I SHARDING [PeriodicShardedIndexConsistencyChecker] Checking consistency of sharded collection indexes across the cluster 2020-12-12T08:06:53.837Z I SHARDING [PeriodicShardedIndexConsistencyChecker] Found 0 collections with inconsistent indexes 2020-12-12T08:07:15.995Z I NETWORK [listener] connection accepted from 10.124.128.43:43410 #320308 (26 connections now open)
This issue occurred to us in version 4.2.5, I thought it was similar to https://jira.mongodb.org/browse/SERVER-47553 so I've upgraded to version 4.2.9 and it happens again and again in complete different clusters, which indicates that it is not a specific server or os issue.
I've defined this issue as Blocker - P1 since it is affecting multiple PROD environments.
The logs from the mongos and the config primary server are attached.
- duplicates
-
SERVER-52654 new signing keys not generated by the monitoring-keys-for-HMAC thread
- Closed