[SERVER-38220] When a mongod is dropped, mongos keeps retrying to connect to mongod even if the client has killed the connection. Created: 21/Nov/18  Updated: 27/Oct/23  Resolved: 26/Nov/18

Status: Closed
Project: Core Server
Component/s: Networking, Sharding
Affects Version/s: 3.4.18
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Feng Liyuan [X] Assignee: Danny Hatcher (Inactive)
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

1. my environment is 

10.200.20.21 client
10.200.20.59:2710 mongos
10.200.20.64:2701 mongod

my client is a go program that do a single query

package main
 
import (
	"log"
	"time"
 
	"github.com/globalsign/mgo"
	"gopkg.in/mgo.v2/bson"
)
 
func main() {
	s, err := mgo.Dial("10.200.20.59:2710")
	if err != nil {
		panic(err)
	}
	s.SetSocketTimeout(1.5e9)
	coll := s.DB("beta_rs").C("rs")
	var ret interface{}
	log.Println("start")
	err = coll.Find(bson.M{"_id": "8h1hkp:BatchUploadFiles_98Dk5t7Mkb"}).One(&ret)
	log.Println(ret, err)
	time.Sleep(100e9)
}

2. drop all connection from mongos to mongod
run this script on mongod's server:

iptables  -I  qiniu-shield-chain  4  -s 10.200.20.59 -p tcp  --dport 2701  -j DROP

3. run the client, the query keeps about 1.5 second and meets SocketTimeout, then the client killed the connection.

qboxserver@cs1:~$ ./testrs
2018/11/20 16:44:06 start
2018/11/20 16:44:07 <nil> read tcp 10.200.20.21:20006->10.200.20.59:2710: i/o timeout
^C

4. But mongos keeps the connection 13 seconds, trys to connect to mongod 4 times.

 2018-11-20T16:44:06.894+0800 I NETWORK  [thread2] connection accepted from 10.200.20.21:20006 #18188 (14 connections now open)
2018-11-20T16:44:06.895+0800 I NETWORK  [conn18188] received client metadata from 10.200.20.21:20006 conn18188: { driver: { name: "mgo", version: "globalsign" }, os: { architecture: "amd64", type: "linux" } }
2018-11-20T16:44:06.896+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Connecting to 10.200.20.64:2701
2018-11-20T16:44:08.397+0800 I NETWORK  [thread2] connection accepted from 10.200.20.21:25687 #18189 (15 connections now open)
2018-11-20T16:44:08.397+0800 I NETWORK  [conn18189] received client metadata from 10.200.20.21:25687 conn18189: { driver: { version: "globalsign", name: "mgo" }, os: { type: "linux", architecture: "amd64" } }
2018-11-20T16:44:09.897+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Failed to connect to 10.200.20.64:2701 - HostUnreachable: Connection timed out
2018-11-20T16:44:09.897+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Dropping all pooled connections to 10.200.20.64:2701 due to failed operation on a connection
2018-11-20T16:44:09.897+0800 I NETWORK  [NetworkInterfaceASIO-TaskExecutorPool-35-0] Marking host 10.200.20.64:2701 as failed :: caused by :: HostUnreachable: Connection timed out
2018-11-20T16:44:09.898+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Connecting to 10.200.20.64:2701
2018-11-20T16:44:12.901+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Failed to connect to 10.200.20.64:2701 - HostUnreachable: Connection timed out
2018-11-20T16:44:12.901+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Dropping all pooled connections to 10.200.20.64:2701 due to failed operation on a connection
2018-11-20T16:44:12.901+0800 I NETWORK  [NetworkInterfaceASIO-TaskExecutorPool-35-0] Marking host 10.200.20.64:2701 as failed :: caused by :: HostUnreachable: Connection timed out
2018-11-20T16:44:13.403+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Connecting to 10.200.20.64:2701
2018-11-20T16:44:16.405+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Failed to connect to 10.200.20.64:2701 - HostUnreachable: Connection timed out
2018-11-20T16:44:16.405+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Dropping all pooled connections to 10.200.20.64:2701 due to failed operation on a connection
2018-11-20T16:44:16.405+0800 I NETWORK  [NetworkInterfaceASIO-TaskExecutorPool-35-0] Marking host 10.200.20.64:2701 as failed :: caused by :: HostUnreachable: Connection timed out
2018-11-20T16:44:16.907+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Connecting to 10.200.20.64:2701
2018-11-20T16:44:19.909+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Failed to connect to 10.200.20.64:2701 - HostUnreachable: Connection timed out
2018-11-20T16:44:19.909+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Dropping all pooled connections to 10.200.20.64:2701 due to failed operation on a connection
2018-11-20T16:44:19.909+0800 I NETWORK  [NetworkInterfaceASIO-TaskExecutorPool-35-0] Marking host 10.200.20.64:2701 as failed :: caused by :: HostUnreachable: Connection timed out
2018-11-20T16:44:19.910+0800 I -        [conn18188] end connection 10.200.20.21:20006 (15 connections now open)
 
# entered ctrl+C on client here
 
2018-11-20T16:44:26.151+0800 I -        [conn18189] end connection 10.200.20.21:25687 (14 connections now open)

5. When retrying, mongos keeps a CLOSE_WAIT socket with the client.

mongos  32131 qboxserver   34u     IPv4         3719101316      0t0        TCP 10.200.20.59:2710->10.200.20.21:25057 (CLOSE_WAIT)
mongos  32131 qboxserver  137u     IPv4         3719114023      0t0        TCP 10.200.20.59:2710->10.200.20.21:39569 (ESTABLISHED)

 

Participants:

 Description   

When a mongod is dropped, mongos keeps retrying to connect to mongod even if the client has killed the connection.

The mongos rapidly inrcrease its open files if mongos do not stop retrying to connnect to mongod if it has got a FIN from the client.



 Comments   
Comment by Feng Liyuan [X] [ 27/Nov/18 ]

Hi Daneil,

In our production environment, the client keeps to query mongos in 2000 concurrency. If one mongd occurs something wrong, all mongos would keep thousands of CLOSE_WAIT connections to reach to fd limit.

Is it more graceful if the second connection should stop retrying if the first connection has gone.

 

Besides, is there an option to control the connection timeout from mongos to mongod? The mongos' log said a creation of one connection took 98 seconds.

2018-11-19T20:24:11.424+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-5-0] Successfully connected to 10.20.74.29:2848, took 98977ms (69 connections now open to 10.20.74.29:2848)

And is there an option to control the kMaxNumFailedHostRetryAttempts?

const int kMaxNumFailedHostRetryAttempts = 3;

 

Comment by Danny Hatcher (Inactive) [ 26/Nov/18 ]

Hello Feng,

There are two distinct connections in use during this scenario. The first connection is between the driver and the mongos, the second connection is between the mongos and the mongod. Apart from a few options such as maxTimeMS, the driver has little control over the connection between the mongos and the mongod. If the mongos can no longer contact the mongod, it will retry connections regardless of the driver's status. If you are looking at configuring the connection options between the mongos and the mongod, I recommend you read the documentation for the various ShardingTaskExecutorPool settings.

Given the above, I will close this ticket as working as designed.

Thank you,

Danny

Generated at Thu Feb 08 04:48:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.