[SERVER-11322] Shard router does not connect to the config servers Created: 23/Oct/13  Updated: 10/Dec/14  Resolved: 05/Dec/13

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dharshan Rangegowda Assignee: David Storch
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Config servers - 2.4.1, Shard routers are 2.4.6


Issue Links:
Duplicate
is duplicated by SERVER-11321 db.locks.find() on the config server ... Closed
Operating System: ALL
Participants:

 Description   

We had some temporary network issues for a few hours. After a few hours the shard routers fail to connect to the config server. We had to provision new shard routers which appear to work fine. I see a lot of these errors
"9001 socket exception [RECV_ERROR] server [162.243.34.127:27019]
Wed Oct 23 04:34:55.101 [conn4] call failed to: SG-edSpringProdSSDShard-1354.servers.mongodirector.com:27019 no data"

Wed Oct 23 03:51:07.624 [mongosMain] connection accepted from 24.16.96.170:61958 #2 (2 connections now open)
Wed Oct 23 03:51:07.800 [conn2] authenticate db: admin

{ authenticate: 1.0, user: "admin", nonce: "96b567075c7a1cdd", key: "fc212ee3f4b25c17a0c2d454e97ace8c" }

Wed Oct 23 03:51:17.377 [conn2] creating WriteBackListener for: SG-edSpringProdSSDShard-1359.servers.mongodirector.com:27017 serverID: 5267474390aa292bd1d1aedc
Wed Oct 23 03:51:17.377 [conn2] creating WriteBackListener for: SG-edSpringProdSSDShard-1360.servers.mongodirector.com:27017 serverID: 5267474390aa292bd1d1aedc
Wed Oct 23 03:51:17.384 [conn2] creating WriteBackListener for: SG-edSpringProdSSDShard-1362.servers.mongodirector.com:27017 serverID: 5267474390aa292bd1d1aedc
Wed Oct 23 03:51:17.384 [conn2] creating WriteBackListener for: SG-edSpringProdSSDShard-1363.servers.mongodirector.com:27017 serverID: 5267474390aa292bd1d1aedc
Wed Oct 23 03:51:17.385 [conn2] SyncClusterConnection connecting to [SG-edSpringProdSSDShard-1353.servers.mongodirector.com:27019]
Wed Oct 23 03:51:17.394 [conn2] SyncClusterConnection connecting to [SG-edSpringProdSSDShard-1354.servers.mongodirector.com:27019]
Wed Oct 23 03:51:17.404 [conn2] SyncClusterConnection connecting to [SG-edSpringProdSSDShard-1355.servers.mongodirector.com:27019]
Wed Oct 23 03:53:37.396 [mongosMain] connection accepted from 24.16.96.170:62192 #3 (3 connections now open)
Wed Oct 23 03:53:37.585 [conn3] end connection 24.16.96.170:62192 (2 connections now open)
Wed Oct 23 03:54:51.748 [mongosMain] connection accepted from 24.16.96.170:62305 #4 (3 connections now open)
Wed Oct 23 03:54:51.956 [conn4] authenticate db: admin

{ authenticate: 1.0, user: "admin", nonce: "b2e8a323d2aad612", key: "3c257da656cbe8f088a0cb0f70cd8e8d" }

Wed Oct 23 03:54:54.607 [conn4] SyncClusterConnection connecting to [SG-edSpringProdSSDShard-1353.servers.mongodirector.com:27019]
Wed Oct 23 03:54:54.633 [conn4] SyncClusterConnection connecting to [SG-edSpringProdSSDShard-1354.servers.mongodirector.com:27019]
Wed Oct 23 03:54:54.642 [conn4] SyncClusterConnection connecting to [SG-edSpringProdSSDShard-1355.servers.mongodirector.com:27019]
Wed Oct 23 04:09:24.390 [Balancer] Socket recv() errno:104 Connection reset by peer 162.243.34.126:27019
Wed Oct 23 04:09:24.390 [Balancer] SocketException: remote: 162.243.34.126:27019 error: 9001 socket exception [RECV_ERROR] server [162.243.34.126:27019]
Wed Oct 23 04:09:24.391 [Balancer] DBClientCursor::init call() failed
Wed Oct 23 04:09:24.391 [Balancer] Detected bad connection created at 1382500164387060 microSec, clearing pool for SG-edSpringProdSSDShard-1353.servers.mongodirector.com:27019
Wed Oct 23 04:09:24.393 [Balancer] scoped connection to SG-edSpringProdSSDShard-1353.servers.mongodirector.com:27019,SG-edSpringProdSSDShard-1354.servers.mongodirector.com:27019,SG-edSpringP
rodSSDShard-1355.servers.mongodirector.com:27019 not being returned to the pool
Wed Oct 23 04:09:24.393 [Balancer] caught exception while doing balance: DBClientBase::findN: transport error: SG-edSpringProdSSDShard-1353.servers.mongodirector.com:27019 ns: admin.$cmd que
ry:

{ serverStatus: 1 }

Wed Oct 23 04:09:30.393 [Balancer] SyncClusterConnection connecting to [SG-edSpringProdSSDShard-1353.servers.mongodirector.com:27019]
Wed Oct 23 04:09:30.402 [Balancer] SyncClusterConnection connecting to [SG-edSpringProdSSDShard-1354.servers.mongodirector.com:27019]
Wed Oct 23 04:09:30.428 [Balancer] SyncClusterConnection connecting to [SG-edSpringProdSSDShard-1355.servers.mongodirector.com:27019]
Wed Oct 23 04:11:17.889 [conn2] Socket recv() errno:104 Connection reset by peer 162.243.34.126:27019
Wed Oct 23 04:11:17.890 [conn2] SocketException: remote: 162.243.34.126:27019 error: 9001 socket exception [RECV_ERROR] server [162.243.34.126:27019]
Wed Oct 23 04:11:17.890 [conn2] call failed to: SG-edSpringProdSSDShard-1353.servers.mongodirector.com:27019 no data
Wed Oct 23 04:12:40.329 [mongosMain] connection accepted from 127.0.0.1:53957 #5 (4 connections now open)
Wed Oct 23 04:12:40.341 [conn5] authenticate db: admin

{ authenticate: 1, user: "admin", nonce: "24cac367070c4f50", key: "7b42d5354cf8a4afc84509e906621ddc" }

Wed Oct 23 04:12:40.357 [conn5] end connection 127.0.0.1:53957 (3 connections now open)
Wed Oct 23 04:12:45.242 [mongosMain] connection accepted from 127.0.0.1:53958 #6 (4 connections now open)
Wed Oct 23 04:12:45.247 [conn6] authenticate db: admin

{ authenticate: 1, user: "admin", nonce: "6910ab29638f827e", key: "7416a27c7de0f612da402ea9c26e17f8" }

Wed Oct 23 04:12:45.259 [conn6] end connection 127.0.0.1:53958 (3 connections now open)
Wed Oct 23 04:12:50.985 [mongosMain] connection accepted from 127.0.0.1:53960 #7 (4 connections now open)
Wed Oct 23 04:12:50.987 [conn7] authenticate db: admin

{ authenticate: 1, user: "admin", nonce: "c165c13b6085096c", key: "ff65f80289b49660c8cc8514b9ba8585" }

Wed Oct 23 04:12:50.997 [conn7] end connection 127.0.0.1:53960 (3 connections now open)
Wed Oct 23 04:12:55.804 [mongosMain] connection accepted from 127.0.0.1:53961 #8 (4 connections now open)
Wed Oct 23 04:12:55.808 [conn8] authenticate db: admin

{ authenticate: 1, user: "admin", nonce: "38491cc1734b1ab6", key: "133d746a012fb4083dd6f60d9e091635" }

Wed Oct 23 04:12:55.822 [conn8] end connection 127.0.0.1:53961 (3 connections now open)
Wed Oct 23 04:14:55.097 [conn4] Socket recv() errno:104 Connection reset by peer 162.243.34.126:27019
Wed Oct 23 04:14:55.097 [conn4] SocketException: remote: 162.243.34.126:27019 error: 9001 socket exception [RECV_ERROR] server [162.243.34.126:27019]
Wed Oct 23 04:14:55.097 [conn4] call failed to: SG-edSpringProdSSDShard-1353.servers.mongodirector.com:27019 no data
Wed Oct 23 04:29:30.463 [Balancer] Socket recv() errno:104 Connection reset by peer 162.243.34.126:27019
Wed Oct 23 04:29:30.463 [Balancer] SocketException: remote: 162.243.34.126:27019 error: 9001 socket exception [RECV_ERROR] server [162.243.34.126:27019]
Wed Oct 23 04:29:30.463 [Balancer] DBClientCursor::init call() failed
Wed Oct 23 04:29:30.463 [Balancer] Detected bad connection created at 1382501370458915 microSec, clearing pool for SG-edSpringProdSSDShard-1353.servers.mongodirector.com:27019
Wed Oct 23 04:29:30.464 [Balancer] scoped connection to SG-edSpringProdSSDShard-1353.servers.mongodirector.com:27019,SG-edSpringProdSSDShard-1354.servers.mongodirector.com:27019,SG-edSpringP
rodSSDShard-1355.servers.mongodirector.com:27019 not being returned to the pool
Wed Oct 23 04:29:30.464 [Balancer] caught exception while doing balance: DBClientBase::findN: transport error: SG-edSpringProdSSDShard-1353.servers.mongodirector.com:27019 ns: admin.$cmd que
ry:

{ serverStatus: 1 }

Wed Oct 23 04:29:36.465 [Balancer] SyncClusterConnection connecting to [SG-edSpringProdSSDShard-1353.servers.mongodirector.com:27019]
Wed Oct 23 04:29:36.475 [Balancer] SyncClusterConnection connecting to [SG-edSpringProdSSDShard-1354.servers.mongodirector.com:27019]
Wed Oct 23 04:29:36.478 [Balancer] SyncClusterConnection connecting to [SG-edSpringProdSSDShard-1355.servers.mongodirector.com:27019]
Wed Oct 23 04:31:17.893 [conn2] Socket recv() errno:104 Connection reset by peer 162.243.34.127:27019
Wed Oct 23 04:31:17.893 [conn2] SocketException: remote: 162.243.34.127:27019 error: 9001 socket exception [RECV_ERROR] server [162.243.34.127:27019]
Wed Oct 23 04:31:17.893 [conn2] call failed to: SG-edSpringProdSSDShard-1354.servers.mongodirector.com:27019 no data
Wed Oct 23 04:34:55.100 [conn4] Socket recv() errno:104 Connection reset by peer 162.243.34.127:27019
Wed Oct 23 04:34:55.101 [conn4] SocketException: remote: 162.243.34.127:27019 error: 9001 socket exception [RECV_ERROR] server [162.243.34.127:27019]
Wed Oct 23 04:34:55.101 [conn4] call failed to: SG-edSpringProdSSDShard-1354.servers.mongodirector.com:27019 no data
Wed Oct 23 04:49:13.309 [mongosMain] dbexit: received signal 15 rc:0 received signal 15



 Comments   
Comment by David Storch [ 04/Nov/13 ]

Hi Dharshan,

In order to help us move forward, could you please provide some clarification about the problem you are trying to solve:

  • Are you concerned that there might be some fallout or lingering problems caused by the temporary network issues you experienced?
  • Is the problem that the sharding routers did not recover automatically following the network issues; i.e. the problem is that recovery required you to restart the MongoS?
  • Are you trying to diagnose the root cause of the network problems?

Finally, could you please provide the complete logs from the MongoS instance during the period in which you experienced connection problems between MongoS and the config server?

Generated at Thu Feb 08 03:25:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.