How to reproduce SERVER-11332: when 1st config server is down, some operations wait for network timeout ****************************************************************************** Setup: $ cat keyfile qwertyuiop Start in different terminals: $ mongod --vvvv --configsvr --dbpath config1 --port 27001 --auth --keyFile keyfile $ mongod --vvvv --configsvr --dbpath config2 --port 27002 --auth --keyFile keyfile $ mongod --vvvv --configsvr --dbpath config3 --port 27003 --auth --keyFile keyfile $ mongos --configdb localhost:27001,localhost:27002,localhost:27003 --vvvv --keyFile keyfile $ mongod --vvvv --port 27000 --dbpath data1/ --auth --keyFile keyfile Create a user or two and insert test data $ mongo > use admin > db.addUser( { user: "hingo", pwd: "password", roles: [ "readWriteAnyDatabase", "userAdminAnyDatabase", "dbAdminAnyDatabase", "clusterAdmin" ] } ) > use test > db.addUser( { user: "hingo", pwd: "password", roles: [ "readWriteAnyDatabase", "userAdminAnyDatabase", "dbAdminAnyDatabase", "clusterAdmin" ] } ) > db.foo.insert( { v : "bar" } ) **************************************************************************** Create a test script that connects to mongos and reads test db every second $ cat readloop.sh #!/bin/bash while sleep 1 do echo date # hingo is defined in admin db, so must connect there echo "db.getSiblingDB( 'test').foo.findOne()" | mongo --username hingo --password password admin done $ bash readloop.sh Thu Dec 26 00:56:44 EET 2013 MongoDB shell version: 2.4.8 connecting to: admin { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 00:56:45 EET 2013 MongoDB shell version: 2.4.8 connecting to: admin { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 00:56:46 EET 2013 MongoDB shell version: 2.4.8 connecting to: admin { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 00:56:47 EET 2013 MongoDB shell version: 2.4.8 connecting to: admin { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 00:56:48 EET 2013 MongoDB shell version: 2.4.8 connecting to: admin { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye --> Ctrl-Z on configsvr 1 here Thu Dec 26 00:56:49 EET 2013 MongoDB shell version: 2.4.8 connecting to: admin --> 2 minutes until timeout. This seems to be completely dependent on OS configuration. I have Ubuntu 12.04. { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 00:58:50 EET 2013 MongoDB shell version: 2.4.8 connecting to: admin { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye --> fg on configsvr 1 here Thu Dec 26 01:00:52 EET 2013 MongoDB shell version: 2.4.8 connecting to: admin { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:02:53 EET 2013 MongoDB shell version: 2.4.8 connecting to: admin { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye ... ************************************************************************************** $ cat readloop.sh #!/bin/bash while sleep 1 do echo date # henrik is defined in test db, so must connect there echo "db.foo.findOne()" | mongo --username henrik --password password test done Unfortunately it happens also when user is defined in the sharded db. Is there any good reason for that? $ bash readloop.sh Thu Dec 26 01:09:50 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:09:51 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:09:52 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:10:54 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:11:55 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:12:06 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:12:07 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye ... **************************************************************************** Restart cluster with auth disabled: $ mongod --vvvv --configsvr --dbpath config1 --port 27001 $ mongod --vvvv --configsvr --dbpath config2 --port 27002 $ mongod --vvvv --configsvr --dbpath config3 --port 27003 $ mongos --configdb localhost:27001,localhost:27002,localhost:27003 --vvvv $ mongod --vvvv --port 27000 --dbpath data1/ ****************************************************************************** This is just to verify that, as we have speculated, the issue goes away when authentication is not used: $ cat readloop.sh #!/bin/bash while sleep 1 do echo date # henrik is defined in test db, so must connect there echo "db.foo.findOne()" | mongo done $ bash readloop.sh Thu Dec 26 01:27:43 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:27:44 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:27:45 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:27:46 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:27:47 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:27:48 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:27:49 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:27:51 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:27:52 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:27:53 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:27:54 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:27:55 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:27:56 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:27:57 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:27:58 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:27:59 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:28:00 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye Thu Dec 26 01:28:01 EET 2013 MongoDB shell version: 2.4.8 connecting to: test { "_id" : ObjectId("52bb5fb8d741e37cda2f195b"), "v" : "bar" } bye ^C ********************************************************************** This is mongos -vvvv log from one of the failed connections Thu Dec 26 01:10:54.088 [mongosMain] connection accepted from 127.0.0.1:52259 #67 (1 connection now open) Thu Dec 26 01:10:54.088 [conn67] trying reconnect to localhost:27001 Thu Dec 26 01:10:54.088 BackgroundJob starting: ConnectBG Thu Dec 26 01:10:54.088 [conn67] reconnect localhost:27001 ok Thu Dec 26 01:10:54.302 [Balancer] Socket recv() timeout 127.0.0.1:27001 Thu Dec 26 01:10:54.302 [Balancer] SocketException: remote: 127.0.0.1:27001 error: 9001 socket exception [RECV_TIMEOUT] server [127.0.0.1:27001] Thu Dec 26 01:10:54.302 [Balancer] DBClientCursor::init call() failed Thu Dec 26 01:10:54.302 [Balancer] User Assertion: 10276:DBClientBase::findN: transport error: localhost:27001 ns: local.$cmd query: { getnonce: 1 } Thu Dec 26 01:10:54.302 [Balancer] query failed to: localhost:27001 exception: DBClientBase::findN: transport error: localhost:27001 ns: local.$cmd query: { getnonce: 1 } Thu Dec 26 01:10:54.303 [Balancer] Refreshing MaxChunkSize: 64 Thu Dec 26 01:10:54.303 [Balancer] trying reconnect to localhost:27001 Thu Dec 26 01:10:54.303 BackgroundJob starting: ConnectBG Thu Dec 26 01:10:54.303 [Balancer] reconnect localhost:27001 ok Thu Dec 26 01:11:14.422 Socket recv() timeout 127.0.0.1:27001 Thu Dec 26 01:11:14.422 SocketException: remote: 127.0.0.1:27001 error: 9001 socket exception [RECV_TIMEOUT] server [127.0.0.1:27001] Thu Dec 26 01:11:14.422 DBClientCursor::init call() failed Thu Dec 26 01:11:14.422 User Assertion: 10276:DBClientBase::findN: transport error: localhost:27001 ns: config.$cmd query: { dbhash: 1, collections: [ "chunks", "databases" ] } Thu Dec 26 01:11:14.422 warning: couldn't check dbhash on config server localhost:27001 :: caused by :: 10276 DBClientBase::findN: transport error: localhost:27001 ns: config.$cmd query: { dbhash: 1, collections: [ "chunks", "databases" ] } Thu Dec 26 01:11:15.390 [LockPinger] Socket recv() timeout 127.0.0.1:27001 Thu Dec 26 01:11:15.390 [LockPinger] SocketException: remote: 127.0.0.1:27001 error: 9001 socket exception [RECV_TIMEOUT] server [127.0.0.1:27001] Thu Dec 26 01:11:15.390 [LockPinger] DBClientCursor::init call() failed Thu Dec 26 01:11:15.390 [LockPinger] User Assertion: 10276:DBClientBase::findN: transport error: localhost:27001 ns: local.$cmd query: { getnonce: 1 } Thu Dec 26 01:11:15.442 [LockPinger] scoped connection to localhost:27001,localhost:27002,localhost:27003 not being returned to the pool Thu Dec 26 01:11:15.442 [LockPinger] warning: distributed lock pinger 'localhost:27001,localhost:27002,localhost:27003/hingo-sputnik:27017:1388010584:1804289383' detected an exception while pinging. :: caused by :: SyncClusterConnection::udpate prepare failed: localhost:27001:10276 DBClientBase::findN: transport error: localhost:27001 ns: local.$cmd query: { getnonce: 1 } Thu Dec 26 01:11:24.086 [conn67] Socket recv() timeout 127.0.0.1:27001 Thu Dec 26 01:11:24.086 [conn67] SocketException: remote: 127.0.0.1:27001 error: 9001 socket exception [RECV_TIMEOUT] server [127.0.0.1:27001] Thu Dec 26 01:11:24.086 [conn67] DBClientCursor::init call() failed Thu Dec 26 01:11:24.086 [conn67] User Assertion: 10276:DBClientBase::findN: transport error: localhost:27001 ns: local.$cmd query: { getnonce: 1 } Thu Dec 26 01:11:24.086 [conn67] query failed to: localhost:27001 exception: DBClientBase::findN: transport error: localhost:27001 ns: local.$cmd query: { getnonce: 1 } Thu Dec 26 01:11:24.087 [conn67] Request::process begin ns: admin.$cmd msg id: 0 op: 2004 attempt: 0 Thu Dec 26 01:11:24.087 [conn67] single query: admin.$cmd { whatsmyuri: 1 } ntoreturn: 1 options : 0 Thu Dec 26 01:11:24.087 [conn67] Request::process end ns: admin.$cmd msg id: 0 op: 2004 attempt: 0 0ms Thu Dec 26 01:11:24.089 [conn67] Request::process begin ns: test.$cmd msg id: 1 op: 2004 attempt: 0 Thu Dec 26 01:11:24.089 [conn67] single query: test.$cmd { getnonce: 1 } ntoreturn: 1 options : 0 Thu Dec 26 01:11:24.089 [conn67] Request::process end ns: test.$cmd msg id: 1 op: 2004 attempt: 0 0ms Thu Dec 26 01:11:24.089 [conn67] Request::process begin ns: test.$cmd msg id: 2 op: 2004 attempt: 0 Thu Dec 26 01:11:24.089 [conn67] single query: test.$cmd { authenticate: 1, nonce: "28cd283a70c93ec3", user: "henrik", key: "ecec87f680f9fcfee7a8a80e77b715be" } ntoreturn: 1 options : 0 Thu Dec 26 01:11:24.089 [conn67] authenticate db: test { authenticate: 1, nonce: "28cd283a70c93ec3", user: "henrik", key: "ecec87f680f9fcfee7a8a80e77b715be" } Thu Dec 26 01:11:24.090 [conn67] trying reconnect to localhost:27001 Thu Dec 26 01:11:24.091 BackgroundJob starting: ConnectBG Thu Dec 26 01:11:24.091 [conn67] reconnect localhost:27001 ok Thu Dec 26 01:11:24.302 [Balancer] Socket recv() timeout 127.0.0.1:27001 Thu Dec 26 01:11:24.302 [Balancer] SocketException: remote: 127.0.0.1:27001 error: 9001 socket exception [RECV_TIMEOUT] server [127.0.0.1:27001] Thu Dec 26 01:11:24.302 [Balancer] DBClientCursor::init call() failed Thu Dec 26 01:11:24.302 [Balancer] User Assertion: 10276:DBClientBase::findN: transport error: localhost:27001 ns: local.$cmd query: { getnonce: 1 } Thu Dec 26 01:11:24.302 [Balancer] query failed to: localhost:27001 exception: DBClientBase::findN: transport error: localhost:27001 ns: local.$cmd query: { getnonce: 1 } Thu Dec 26 01:11:24.304 [Balancer] trying to acquire new distributed lock for balancer on localhost:27001,localhost:27002,localhost:27003 ( lock timeout : 900000, ping interval : 30000, process : hingo-sputnik:27017:1388010584:1804289383 ) Thu Dec 26 01:11:44.802 [PeriodicTask::Runner] task: DBConnectionPool-cleaner took: 0ms Thu Dec 26 01:11:44.802 [PeriodicTask::Runner] task: DBConnectionPool-cleaner took: 0ms Thu Dec 26 01:11:45.442 [LockPinger] distributed lock pinger 'localhost:27001,localhost:27002,localhost:27003/hingo-sputnik:27017:1388010584:1804289383' about to ping. Thu Dec 26 01:11:45.442 [LockPinger] trying reconnect to localhost:27001 Thu Dec 26 01:11:45.442 BackgroundJob starting: ConnectBG Thu Dec 26 01:11:45.443 [LockPinger] reconnect localhost:27001 ok Thu Dec 26 01:11:54.090 [conn67] Socket recv() timeout 127.0.0.1:27001 Thu Dec 26 01:11:54.090 [conn67] SocketException: remote: 127.0.0.1:27001 error: 9001 socket exception [RECV_TIMEOUT] server [127.0.0.1:27001] Thu Dec 26 01:11:54.090 [conn67] DBClientCursor::init call() failed Thu Dec 26 01:11:54.090 [conn67] User Assertion: 10276:DBClientBase::findN: transport error: localhost:27001 ns: local.$cmd query: { getnonce: 1 } Thu Dec 26 01:11:54.090 [conn67] query failed to: localhost:27001 exception: DBClientBase::findN: transport error: localhost:27001 ns: local.$cmd query: { getnonce: 1 } Thu Dec 26 01:11:54.092 [conn67] Request::process end ns: test.$cmd msg id: 2 op: 2004 attempt: 0 30002ms Thu Dec 26 01:11:54.096 [conn67] Request::process begin ns: admin.$cmd msg id: 3 op: 2004 attempt: 0 Thu Dec 26 01:11:54.096 [conn67] single query: admin.$cmd { replSetGetStatus: 1.0, forShell: 1.0 } ntoreturn: -1 options : 0 Thu Dec 26 01:11:54.096 [conn67] Request::process end ns: admin.$cmd msg id: 3 op: 2004 attempt: 0 0ms Thu Dec 26 01:11:54.098 [conn67] Request::process begin ns: test.foo msg id: 4 op: 2004 attempt: 0 Thu Dec 26 01:11:54.098 [conn67] shard query: test.foo {} Thu Dec 26 01:11:54.098 [conn67] [pcursor] creating pcursor over QSpec { ns: "test.foo", n2skip: 0, n2return: -1, options: 0, query: {}, fields: {} } and CInfo { v_ns: "", filter: {} } Thu Dec 26 01:11:54.098 [conn67] [pcursor] initializing over 1 shards required by [unsharded @ shard0000:localhost:27000] Thu Dec 26 01:11:54.098 [conn67] [pcursor] initializing on shard shard0000:localhost:27000, current connection state is { state: {}, retryNext: false, init: false, finish: false, errored: false } Thu Dec 26 01:11:54.099 [conn67] [pcursor] initialized query (lazily) on shard shard0000:localhost:27000, current connection state is { state: { conn: "localhost:27000", vinfo: "shard0000:localhost:27000", cursor: "(empty)", count: 0, done: false }, retryNext: false, init: true, finish: false, errored: false } Thu Dec 26 01:11:54.099 [conn67] [pcursor] finishing over 1 shards Thu Dec 26 01:11:54.099 [conn67] [pcursor] finishing on shard shard0000:localhost:27000, current connection state is { state: { conn: "localhost:27000", vinfo: "shard0000:localhost:27000", cursor: "(empty)", count: 0, done: false }, retryNext: false, init: true, finish: false, errored: false } Thu Dec 26 01:11:54.099 [conn67] [pcursor] finished on shard shard0000:localhost:27000, current connection state is { state: { conn: "(done)", vinfo: "shard0000:localhost:27000", cursor: { _id: ObjectId('52bb5fb8d741e37cda2f195b'), v: "bar" }, count: 0, done: false }, retryNext: false, init: true, finish: true, errored: false } Thu Dec 26 01:11:54.099 [conn67] Request::process end ns: test.foo msg id: 4 op: 2004 attempt: 0 0ms Thu Dec 26 01:11:54.108 [conn67] Request::process begin ns: admin.$cmd msg id: 5 op: 2004 attempt: 0 Thu Dec 26 01:11:54.108 [conn67] single query: admin.$cmd { replSetGetStatus: 1.0, forShell: 1.0 } ntoreturn: -1 options : 0 Thu Dec 26 01:11:54.108 [conn67] Request::process end ns: admin.$cmd msg id: 5 op: 2004 attempt: 0 0ms Thu Dec 26 01:11:54.110 [conn67] Socket recv() conn closed? 127.0.0.1:52259 Thu Dec 26 01:11:54.110 [conn67] SocketException: remote: 127.0.0.1:52259 error: 9001 socket exception [CLOSED] server [127.0.0.1:52259] Thu Dec 26 01:11:54.110 [conn67] end connection 127.0.0.1:52259 (0 connections now open)