[SERVER-26701] MongoS stalls when it cannot access one of the CSRS server Created: 19/Oct/16  Updated: 14/Dec/16  Resolved: 14/Dec/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Darshan Shah Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-26723 Mongos stalls for even simple queries Closed
Operating System: ALL
Participants:

 Description   

Running a sharded cluster using MongoDb3.2.5 with WiredTiger and of late, mongos on some machine just stalls when it cannot access one of the CSRS server.

CSRS is a 3 node replicaset and only a couple of mongos processes get stuck in this scenario - we can ignore the reason for one CSRS server not being available for this particular problem

Here is the mogos log snippet when this happened - the mongos tried to switch the order of the CSRS servers and then after a couple of attempts, it just hung. Any queries using that mongos process would not return.

2016-10-19T09:02:30.381-0400 I SHARDING [Balancer] distributed lock 'balancer' acquired for 'doing balance round', ts : 58076ee6087c1d793ad3b986
2016-10-19T09:02:32.657-0400 I SHARDING [Balancer] distributed lock with ts: 58076ee6087c1d793ad3b986' unlocked.
2016-10-19T09:03:47.065-0400 I SHARDING [Balancer] distributed lock 'balancer' acquired for 'doing balance round', ts : 58076f33087c1d793ad3b98d
2016-10-19T09:03:49.341-0400 I SHARDING [Balancer] distributed lock with ts: 58076f33087c1d793ad3b98d' unlocked.
2016-10-19T09:05:28.947-0400 I NETWORK  [ReplicaSetMonitorWatcher] changing hosts to csReplSet/mongoconfigserver1:29102,mongoconfigserver3:29102,mongoconfigserver2:29102 from csReplSet/mongoconfigserver1:29102,mongoconfigserver2:29102
2016-10-19T09:05:28.947-0400 I SHARDING [ReplicaSetMonitorWatcher] Updating config server connection string to: csReplSet/mongoconfigserver1:29102,mongoconfigserver3:29102,mongoconfigserver2:29102
2016-10-19T09:05:28.947-0400 I SHARDING [ReplicaSetMonitorWatcher] Updating ShardRegistry connection string for shard config from: csReplSet/mongoconfigserver1:29102,mongoconfigserver2:29102 to: csReplSet/mongoconfigserver1:29102,mongoconfigserver3:29102,mongoconfigserver2:29102
2016-10-19T09:06:15.520-0400 W SHARDING [Balancer] ExceededTimeLimit: Couldn't get a connection within the time limit



 Comments   
Comment by Kelsey Schubert [ 14/Dec/16 ]

Hi darshan.shah@interactivedata.com,

Since we have identified a glibc bug in SERVER-26723, which likely explains this issue as well I will be closing this as a duplicate. If you experience this issue after ensuring that you are longer affected by this bug, please let us know and we will continue to investigate.

Thank you,
Thomas

Comment by Darshan Shah [ 02/Nov/16 ]

This looks similar as the symptoms are the same - MongoS is stalled.
However, in this case, it is confirmed that the config server was definitely not reachable.
In case of SERVER-26723, the config server is definitely reachable.

Comment by Ramon Fernandez Marina [ 02/Nov/16 ]

darshan.shah@interactivedata.com, is this the same issue that we're investigating under SERVER-26723?

Comment by Darshan Shah [ 19/Oct/16 ]

Note that at the time this occurs, the CSRS node mongoconfigserver2:29102 has the below in its log - showing that it was indeed not reachable at that time:

2016-10-19T09:01:10.957-0400 W NETWORK  [ReplicaSetMonitorWatcher] Failed to connect to 10.170.7.16:29102, reason: errno:111 Connection refused
2016-10-19T09:01:21.945-0400 W NETWORK  [ReplicaSetMonitorWatcher] Failed to connect to 10.170.7.16:29102, reason: errno:111 Connection refused
2016-10-19T09:01:32.933-0400 W NETWORK  [ReplicaSetMonitorWatcher] Failed to connect to 10.170.7.16:29102, reason: errno:111 Connection refused

Generated at Thu Feb 08 04:12:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.