[SERVER-4094] better mongos handling of state where connections can be established but mongod unresponsive Created: 18/Oct/11  Updated: 06/Dec/22  Resolved: 31/May/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Greg Studer Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Done Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File hang_setup.js    
Issue Links:
Depends
Related
related to SERVER-7862 Connection timeouts in mongos Closed
is related to SERVER-4661 Mongos doesn't detect primary change ... Closed
Assigned Teams:
Sharding
Operating System: ALL
Participants:

 Description   

It's possible to get into a hung state in mongos given a sharded cluster with a replica set shard. If the shard primary continues to allow connections but does not respond to other requests, failover will occur and a new primary will be elected normally, but sharded queries via mongos will block and not return.

This failure mode has been observed on EC2.

Can reproduce locally by :
1) Running the script included (sets up a sharded cluster with sharded collection)
2) Running mongo localhost:31000 to connect to the mongos
3) > use foo
4) > db.bar.find().itcount()
5) Stopping all data transfer for connections at the primary via iptables :
sudo /sbin/iptables -A INPUT -i lo -p tcp -m tcp --dport <mongod primary port> -m conntrack --ctstate ESTABLISHED -j DROP
6) (wait for failover)
7) > db.bar.find().itcount()



 Comments   
Comment by Ratika Gandhi [ 31/May/19 ]

TCP heartbeats should solve the problem of backhole-ing networking

Comment by Eric Milkie [ 17/Jan/12 ]

Upon further discussion, it sounds like it would be better to deliver a SIGHUP signal to the thread blocked in the recv(), and set up a signal handler just for this area of code.

Comment by Eric Milkie [ 17/Jan/12 ]

It looks like it would be okay to close the socket as a way of freeing up the thread blocked on a recv() of the dead socket.

Comment by Greg Studer [ 18/Oct/11 ]

Reproduced in master

Comment by Greg Studer [ 18/Oct/11 ]

To clarify, only tried with 2.0.0, not sure if other versions affected.

Generated at Thu Feb 08 03:04:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.