Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 4.2.5, 4.2.9
Component/s: None
Labels:
None

Operating System:
ALL
Case:
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Hello,

This issue is happening to us in several PRODUCTION environments and it's very serious.

From time to time, mongos service just hangs, applications are unable to connect to ANY of the mongos servers, and the connection just waits and eventually times out.

System.TimeoutException: A timeout occured after 30000ms selecting a server using CompositeServerSelector{ Selectors = MongoDB.Driver.MongoClient+AreSessionsSupportedServerSelector, LatencyLimitingServerSelector{ AllowedLatencyRange = 00:00:00.0150000 } }. Client view of cluster state is { ClusterId : "1", ConnectionMode : "Automatic", Type : "Unknown", State : "Disconnected", Servers : [{ ServerId: "{ ClusterId : 1, EndPoint : "10.120.32.68:27017" }", EndPoint: "10.120.32.68:27017", ReasonChanged: "Heartbeat", State: "Disconnected", ServerVersion: , TopologyVersion: , Type: "Unknown", HeartbeatException: "MongoDB.Driver.MongoConnectionException: An exception occurred while opening a connection to the server. ---> MongoDB.Driver.MongoConnectionException: An exception occurred while receiving a message from the server. ---> System.TimeoutException: The operation has timed out.

I connected to the mongos via ssh and tried logging in to mongos, but the issue is the same.

From the mongos logs, we can see the following when it started, over and over again:

2020-12-12T08:06:03.901Z I - [conn1257891] operation was interrupted because a client disconnected 
2020-12-12T08:06:03.901Z I NETWORK [conn1257891] DBException handling request, closing client connection: ClientDisconnect: operation was interrupted 
2020-12-12T08:06:03.905Z I NETWORK [conn1302432] received client metadata from 10.248.127.193:18473 conn1302432: { driver: { name: "mongo-csharp-driver", version: "2.11.3.0" }, os: { type: "Linux", name: "Linux 4.15.0-64-generic #73-Ubuntu SMP T

The issue is being resolved completely when I log in to the primary config server and run the rs.stepDown() command. Once the config primary is changed, everything gets back to normal and connections are coming in.

These are the logs that appear in the cfg primary server at the same time:

2020-12-12T08:06:53.800Z I SHARDING [PeriodicShardedIndexConsistencyChecker] Checking consistency of sharded collection indexes across the cluster 
2020-12-12T08:06:53.837Z I SHARDING [PeriodicShardedIndexConsistencyChecker] Found 0 collections with inconsistent indexes 
2020-12-12T08:07:15.995Z I NETWORK [listener] connection accepted from 10.124.128.43:43410 #320308 (26 connections now open)

This issue occurred to us in version 4.2.5, I thought it was similar to https://jira.mongodb.org/browse/SERVER-47553 so I've upgraded to version 4.2.9 and it happens again and again in complete different clusters, which indicates that it is not a specific server or os issue.

I've defined this issue as Blocker - P1 since it is affecting multiple PROD environments.
The logs from the mongos and the config primary server are attached.