Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-53337

Mongos hangs and stop responding

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 4.2.5, 4.2.9
    • Component/s: None
    • None
    • ALL

      Hello,

      This issue is happening to us in several PRODUCTION environments and it's very serious. 

      From time to time, mongos service just hangs, applications are unable to connect to ANY of the mongos servers, and the connection just waits and eventually times out.

      System.TimeoutException: A timeout occured after 30000ms selecting a server using CompositeServerSelector{ Selectors = MongoDB.Driver.MongoClient+AreSessionsSupportedServerSelector, LatencyLimitingServerSelector{ AllowedLatencyRange = 00:00:00.0150000 } }. Client view of cluster state is { ClusterId : "1", ConnectionMode : "Automatic", Type : "Unknown", State : "Disconnected", Servers : [{ ServerId: "{ ClusterId : 1, EndPoint : "10.120.32.68:27017" }", EndPoint: "10.120.32.68:27017", ReasonChanged: "Heartbeat", State: "Disconnected", ServerVersion: , TopologyVersion: , Type: "Unknown", HeartbeatException: "MongoDB.Driver.MongoConnectionException: An exception occurred while opening a connection to the server. ---> MongoDB.Driver.MongoConnectionException: An exception occurred while receiving a message from the server. ---> System.TimeoutException: The operation has timed out.
      

      I connected to the mongos via ssh and tried logging in to mongos, but the issue is the same.

      From the mongos logs, we can see the following when it started, over and over again:

      2020-12-12T08:06:03.901Z I - [conn1257891] operation was interrupted because a client disconnected 
      2020-12-12T08:06:03.901Z I NETWORK [conn1257891] DBException handling request, closing client connection: ClientDisconnect: operation was interrupted 
      2020-12-12T08:06:03.905Z I NETWORK [conn1302432] received client metadata from 10.248.127.193:18473 conn1302432: { driver: { name: "mongo-csharp-driver", version: "2.11.3.0" }, os: { type: "Linux", name: "Linux 4.15.0-64-generic #73-Ubuntu SMP T
      

      The issue is being resolved completely when I log in to the primary config server and run the rs.stepDown() command. Once the config primary is changed, everything gets back to normal and connections are coming in. 

      These are the logs that appear in the cfg primary server at the same time:

      2020-12-12T08:06:53.800Z I SHARDING [PeriodicShardedIndexConsistencyChecker] Checking consistency of sharded collection indexes across the cluster 
      2020-12-12T08:06:53.837Z I SHARDING [PeriodicShardedIndexConsistencyChecker] Found 0 collections with inconsistent indexes 
      2020-12-12T08:07:15.995Z I NETWORK [listener] connection accepted from 10.124.128.43:43410 #320308 (26 connections now open)
      

      This issue occurred to us in version 4.2.5, I thought it was similar to https://jira.mongodb.org/browse/SERVER-47553 so I've upgraded to version 4.2.9 and it happens again and again in complete different clusters, which indicates that it is not a specific server or os issue.

      I've defined this issue as Blocker - P1 since it is affecting multiple PROD environments.
      The logs from the mongos and the config primary server are attached.

        1. mongod_mongos_logs.zip
          415 kB
        2. Screen Shot 2020-12-22 at 3.35.01 PM.png
          Screen Shot 2020-12-22 at 3.35.01 PM.png
          532 kB
        3. Screen Shot 2020-12-22 at 3.37.33 PM.png
          Screen Shot 2020-12-22 at 3.37.33 PM.png
          522 kB
        4. Screen Shot 2020-12-22 at 4.21.37 PM.png
          Screen Shot 2020-12-22 at 4.21.37 PM.png
          52 kB

            Assignee:
            edwin.zhou@mongodb.com Edwin Zhou
            Reporter:
            ezra.l@sbtech.com Ezra Levi
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: