[SERVER-4997] Mongos not clearing stale connections Created: 17/Feb/12  Updated: 06/Apr/23  Resolved: 19/Feb/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Christian Tonhäuser Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-4706 when a socket between mongos and mong... Closed
Related
related to SERVER-9041 proactively detect broken connections... Closed
Operating System: ALL
Participants:

 Description   

We had the following issue on our production environment today:

Due to a mistake, a mongod process needed to be restarted. This caused the secondary member of the replica set to failover to primary.
However, after the freshly restarted mongod came back up, another election was held and it was re-elected primary.

From that point on, it was no longer possible to query a non-sharded DB that resides on the replica set that experienced the restart.
Connecting to mongos and trying to query the database returned the following error in mongo shell:
[code]
mongos> db.collection.find()
error:

{ "$err" : "socket exception", "code" : 9001 }

[code]

After manually retrying the query by repeating the command over and over (between 20-40 times) in mongo shell, the situation eventually cleared up and queries worked normally again, both from the shell as well as from our application. Unfortunately, this process needed to be repeated for every mongos-instance on the cluster, which is six in total.

It looks to me as if mongos does not check connections to the cluster's other members before using them.
Is it possible to add that functionality?
It wouldn't need to check before every use of the connection (though that behaviour might be desirable in some cases, same way it works for connecting to SQL databases from Java using JDBC connection pools), but the administrator shouldn't need to have to manually sort through.

Or is it already there and we just haven't seen the switch for it, yet?



 Comments   
Comment by Eliot Horowitz (Inactive) [ 19/Feb/12 ]

An admin shouldn't have to cycle today - but you will get one error per connection.
There is a case to do this aggressively: SERVER-4706

Generated at Thu Feb 08 03:07:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.