Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-38220

When a mongod is dropped, mongos keeps retrying to connect to mongod even if the client has killed the connection.

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Works as Designed
    • Affects Version/s: 3.4.18
    • Fix Version/s: None
    • Component/s: Networking, Sharding
    • Labels:
      None
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      1. my environment is 

      10.200.20.21 client
      10.200.20.59:2710 mongos
      10.200.20.64:2701 mongod
      

      my client is a go program that do a single query

      package main
       
      import (
      	"log"
      	"time"
       
      	"github.com/globalsign/mgo"
      	"gopkg.in/mgo.v2/bson"
      )
       
      func main() {
      	s, err := mgo.Dial("10.200.20.59:2710")
      	if err != nil {
      		panic(err)
      	}
      	s.SetSocketTimeout(1.5e9)
      	coll := s.DB("beta_rs").C("rs")
      	var ret interface{}
      	log.Println("start")
      	err = coll.Find(bson.M{"_id": "8h1hkp:BatchUploadFiles_98Dk5t7Mkb"}).One(&ret)
      	log.Println(ret, err)
      	time.Sleep(100e9)
      }
      

      2. drop all connection from mongos to mongod
      run this script on mongod's server:

      iptables  -I  qiniu-shield-chain  4  -s 10.200.20.59 -p tcp  --dport 2701  -j DROP
      

      3. run the client, the query keeps about 1.5 second and meets SocketTimeout, then the client killed the connection.

      qboxserver@cs1:~$ ./testrs
      2018/11/20 16:44:06 start
      2018/11/20 16:44:07 <nil> read tcp 10.200.20.21:20006->10.200.20.59:2710: i/o timeout
      ^C
      

      4. But mongos keeps the connection 13 seconds, trys to connect to mongod 4 times.

       2018-11-20T16:44:06.894+0800 I NETWORK  [thread2] connection accepted from 10.200.20.21:20006 #18188 (14 connections now open)
      2018-11-20T16:44:06.895+0800 I NETWORK  [conn18188] received client metadata from 10.200.20.21:20006 conn18188: { driver: { name: "mgo", version: "globalsign" }, os: { architecture: "amd64", type: "linux" } }
      2018-11-20T16:44:06.896+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Connecting to 10.200.20.64:2701
      2018-11-20T16:44:08.397+0800 I NETWORK  [thread2] connection accepted from 10.200.20.21:25687 #18189 (15 connections now open)
      2018-11-20T16:44:08.397+0800 I NETWORK  [conn18189] received client metadata from 10.200.20.21:25687 conn18189: { driver: { version: "globalsign", name: "mgo" }, os: { type: "linux", architecture: "amd64" } }
      2018-11-20T16:44:09.897+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Failed to connect to 10.200.20.64:2701 - HostUnreachable: Connection timed out
      2018-11-20T16:44:09.897+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Dropping all pooled connections to 10.200.20.64:2701 due to failed operation on a connection
      2018-11-20T16:44:09.897+0800 I NETWORK  [NetworkInterfaceASIO-TaskExecutorPool-35-0] Marking host 10.200.20.64:2701 as failed :: caused by :: HostUnreachable: Connection timed out
      2018-11-20T16:44:09.898+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Connecting to 10.200.20.64:2701
      2018-11-20T16:44:12.901+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Failed to connect to 10.200.20.64:2701 - HostUnreachable: Connection timed out
      2018-11-20T16:44:12.901+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Dropping all pooled connections to 10.200.20.64:2701 due to failed operation on a connection
      2018-11-20T16:44:12.901+0800 I NETWORK  [NetworkInterfaceASIO-TaskExecutorPool-35-0] Marking host 10.200.20.64:2701 as failed :: caused by :: HostUnreachable: Connection timed out
      2018-11-20T16:44:13.403+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Connecting to 10.200.20.64:2701
      2018-11-20T16:44:16.405+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Failed to connect to 10.200.20.64:2701 - HostUnreachable: Connection timed out
      2018-11-20T16:44:16.405+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Dropping all pooled connections to 10.200.20.64:2701 due to failed operation on a connection
      2018-11-20T16:44:16.405+0800 I NETWORK  [NetworkInterfaceASIO-TaskExecutorPool-35-0] Marking host 10.200.20.64:2701 as failed :: caused by :: HostUnreachable: Connection timed out
      2018-11-20T16:44:16.907+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Connecting to 10.200.20.64:2701
      2018-11-20T16:44:19.909+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Failed to connect to 10.200.20.64:2701 - HostUnreachable: Connection timed out
      2018-11-20T16:44:19.909+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-35-0] Dropping all pooled connections to 10.200.20.64:2701 due to failed operation on a connection
      2018-11-20T16:44:19.909+0800 I NETWORK  [NetworkInterfaceASIO-TaskExecutorPool-35-0] Marking host 10.200.20.64:2701 as failed :: caused by :: HostUnreachable: Connection timed out
      2018-11-20T16:44:19.910+0800 I -        [conn18188] end connection 10.200.20.21:20006 (15 connections now open)
       
      # entered ctrl+C on client here
       
      2018-11-20T16:44:26.151+0800 I -        [conn18189] end connection 10.200.20.21:25687 (14 connections now open)
      

      5. When retrying, mongos keeps a CLOSE_WAIT socket with the client.

      mongos  32131 qboxserver   34u     IPv4         3719101316      0t0        TCP 10.200.20.59:2710->10.200.20.21:25057 (CLOSE_WAIT)
      mongos  32131 qboxserver  137u     IPv4         3719114023      0t0        TCP 10.200.20.59:2710->10.200.20.21:39569 (ESTABLISHED)
      

       

      Show
      1. my environment is  10.200.20.21 client 10.200.20.59:2710 mongos 10.200.20.64:2701 mongod my client is a go program that do a single query package main   import ( "log" "time"   "github.com/globalsign/mgo" "gopkg.in/mgo.v2/bson" )   func main() { s, err := mgo.Dial("10.200.20.59:2710") if err != nil { panic(err) } s.SetSocketTimeout(1.5e9) coll := s.DB("beta_rs").C("rs") var ret interface{} log.Println("start") err = coll.Find(bson.M{"_id": "8h1hkp:BatchUploadFiles_98Dk5t7Mkb"}).One(&ret) log.Println(ret, err) time.Sleep(100e9) } 2. drop all connection from mongos to mongod run this script on mongod's server: iptables -I qiniu-shield-chain 4 -s 10.200.20.59 -p tcp --dport 2701 -j DROP 3. run the client, the query keeps about 1.5 second and meets SocketTimeout, then the client killed the connection. qboxserver@cs1:~$ ./testrs 2018/11/20 16:44:06 start 2018/11/20 16:44:07 <nil> read tcp 10.200.20.21:20006->10.200.20.59:2710: i/o timeout ^C 4. But mongos keeps the connection 13 seconds, trys to connect to mongod 4 times. 2018-11-20T16:44:06.894+0800 I NETWORK [thread2] connection accepted from 10.200.20.21:20006 #18188 (14 connections now open) 2018-11-20T16:44:06.895+0800 I NETWORK [conn18188] received client metadata from 10.200.20.21:20006 conn18188: { driver: { name: "mgo", version: "globalsign" }, os: { architecture: "amd64", type: "linux" } } 2018-11-20T16:44:06.896+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-35-0] Connecting to 10.200.20.64:2701 2018-11-20T16:44:08.397+0800 I NETWORK [thread2] connection accepted from 10.200.20.21:25687 #18189 (15 connections now open) 2018-11-20T16:44:08.397+0800 I NETWORK [conn18189] received client metadata from 10.200.20.21:25687 conn18189: { driver: { version: "globalsign", name: "mgo" }, os: { type: "linux", architecture: "amd64" } } 2018-11-20T16:44:09.897+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-35-0] Failed to connect to 10.200.20.64:2701 - HostUnreachable: Connection timed out 2018-11-20T16:44:09.897+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-35-0] Dropping all pooled connections to 10.200.20.64:2701 due to failed operation on a connection 2018-11-20T16:44:09.897+0800 I NETWORK [NetworkInterfaceASIO-TaskExecutorPool-35-0] Marking host 10.200.20.64:2701 as failed :: caused by :: HostUnreachable: Connection timed out 2018-11-20T16:44:09.898+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-35-0] Connecting to 10.200.20.64:2701 2018-11-20T16:44:12.901+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-35-0] Failed to connect to 10.200.20.64:2701 - HostUnreachable: Connection timed out 2018-11-20T16:44:12.901+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-35-0] Dropping all pooled connections to 10.200.20.64:2701 due to failed operation on a connection 2018-11-20T16:44:12.901+0800 I NETWORK [NetworkInterfaceASIO-TaskExecutorPool-35-0] Marking host 10.200.20.64:2701 as failed :: caused by :: HostUnreachable: Connection timed out 2018-11-20T16:44:13.403+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-35-0] Connecting to 10.200.20.64:2701 2018-11-20T16:44:16.405+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-35-0] Failed to connect to 10.200.20.64:2701 - HostUnreachable: Connection timed out 2018-11-20T16:44:16.405+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-35-0] Dropping all pooled connections to 10.200.20.64:2701 due to failed operation on a connection 2018-11-20T16:44:16.405+0800 I NETWORK [NetworkInterfaceASIO-TaskExecutorPool-35-0] Marking host 10.200.20.64:2701 as failed :: caused by :: HostUnreachable: Connection timed out 2018-11-20T16:44:16.907+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-35-0] Connecting to 10.200.20.64:2701 2018-11-20T16:44:19.909+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-35-0] Failed to connect to 10.200.20.64:2701 - HostUnreachable: Connection timed out 2018-11-20T16:44:19.909+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-35-0] Dropping all pooled connections to 10.200.20.64:2701 due to failed operation on a connection 2018-11-20T16:44:19.909+0800 I NETWORK [NetworkInterfaceASIO-TaskExecutorPool-35-0] Marking host 10.200.20.64:2701 as failed :: caused by :: HostUnreachable: Connection timed out 2018-11-20T16:44:19.910+0800 I - [conn18188] end connection 10.200.20.21:20006 (15 connections now open)   # entered ctrl+C on client here   2018-11-20T16:44:26.151+0800 I - [conn18189] end connection 10.200.20.21:25687 (14 connections now open) 5. When retrying, mongos keeps a CLOSE_WAIT socket with the client. mongos 32131 qboxserver 34u IPv4 3719101316 0t0 TCP 10.200.20.59:2710->10.200.20.21:25057 (CLOSE_WAIT) mongos 32131 qboxserver 137u IPv4 3719114023 0t0 TCP 10.200.20.59:2710->10.200.20.21:39569 (ESTABLISHED)  

      Description

      When a mongod is dropped, mongos keeps retrying to connect to mongod even if the client has killed the connection.

      The mongos rapidly inrcrease its open files if mongos do not stop retrying to connnect to mongod if it has got a FIN from the client.

        Attachments

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: