[SERVER-7297] Queries against sharded collection return "all servers down" Created: 08/Oct/12  Updated: 15/Feb/13  Resolved: 02/Jan/13

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Stan McQueen Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu Linux 12.04. All servers are at the 2.2.0 version.


Operating System: ALL
Participants:

 Description   

Two replication sets, each with 3 servers. Three mongod config servers. Two mongos application servers. I created a sharded configuration with the two replication sets and sharded some collections. After bouncing all the servers, queries in the shell against non-sharded collections succeed, but queries against sharded collections return:

MongoDB shell version: 2.2.0
connecting to: mongos:27017/test
mongos> use policydb;
switched to db policydb
mongos> db.policies.findOne();
Mon Oct 8 21:44:44 uncaught exception: error {
"$err" : "setShardVersion failed host: cw-mongodb2-test:27017

{ oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), errmsg: \"exception: all servers down!\", code: 8002, ok: 0.0 }

",
"code" : 10429
}



 Comments   
Comment by xiaxiang lin [ 24/Dec/12 ]

I fixed it follow your hints. Thanks a lot!

Comment by Stan McQueen [ 23/Dec/12 ]

My issue turned out to be a missing host file entry. Each server was accessible from every other server, but not under the host names that the replication set was created with. Check your mongo logs carefully to see if there is a "failure to connect" message in one or more of them. That is what eventually led to my finding the resolution of the problem.

Comment by xiaxiang lin [ 23/Dec/12 ]

I am encounting such a problem, too. My mongo cluster's setting are almost the same as Stan's. The connections between mongos, config servers and replica sets have been checked, and there seems to have no problem.

Comment by Stan McQueen [ 01/Nov/12 ]

While I was preparing the logs to include, I did notice a "failure to connect" message. Upon looking into it, I found that I had omitted a host entry for the mongod servers. Interestingly, addShard and enableSharding worked and did not cause a problem with queries; but when shardCollection was executed, the command succeeded and sh.status seemed to indicate that all was well even though queries would fail. Now that I have added the missing host entries, all is working as expected. Thanks for your help and I apologize for entering a bug that was actually all my own fault.

Comment by Spencer Brody (Inactive) [ 19/Oct/12 ]

Can you include a larger section of the log, with at least 10 minutes to either side of the error?

This looks a lot like a problem with the connection between the shard mongods and the config servers. Can you double-check that it is possible to connect to the config servers via the mongo shell from the shard mongod servers?

Comment by Stan McQueen [ 19/Oct/12 ]

The problem still exists if I enable sharding on a collection. Once I do that, no further queries on that collection succeed. The only remedy is to delete the config server databases and the local databases on the replica set members, restart everything, re-establish the replica set, and create the shard set again. As long as I don't actually enable sharding for a collection, the mongoses, mongods, and mongodbs respond normally. All servers are reachable from all other servers. Currently I am running with a single shard of one replica set consisting of three servers, three config dbs, and two mongoses. Since restarting everything, I have created the shard, adding the replica set to it, but have not enabled sharding for any of the databases or collections. In that mode, everything works fine. However, this is a load-testing cluster, and I would like to be able to test the effects of sharding on the ability to handle load (before our production servers actually get to the point of having high loads).

Our production servers are running MongoDB version 2.0.2 with a single shard of a replica set of three servers. Sharding is enabled on several databases/collections and these servers do not exhibit these symptoms.

My first assumption was that I had somehow screwed up the configuration, since I couldn't believe a major sharding bug could have gotten out into the wild, but my load-test config is virtually identical to the production config.

Comment by Spencer Brody (Inactive) [ 19/Oct/12 ]

Hi Stan,
Sorry for the delayed response. Are you still experiencing this problem? If so, have you tried bouncing your mongoses? Can you confirm that you can create a connection to each of your config servers from the machine that is running your mongos as well as from each of the machines in your replica set?

Comment by Stan McQueen [ 09/Oct/12 ]

By the way, the same symptoms occur with a sharded environment consisting of a single replication set.

Comment by Stan McQueen [ 09/Oct/12 ]

Here is the query and response:
mongos> use licensedb;
switched to db licensedb
mongos> db.licenses.findOne();
Tue Oct 9 21:46:00 uncaught exception: error {
"$err" : "setShardVersion failed host: cw-mongodb1-test:27017

{ oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), errmsg: \"exception: all servers down!\", code: 8002, ok: 0.0 }

",
"code" : 10429
}
==============================================
Here is the output of sh.status():
mongos> sh.status();
— Sharding Status —
sharding version:

{ "_id" : 1, "version" : 3 }

shards:

{ "_id" : "cwset", "host" : "cwset/cw-mongodb1-test:27017,cw-mongodb2-test:27017,cw-mongodb3-test:27017" } { "_id" : "cwset-2", "host" : "cwset-2/cw-mongodb1-2-test:27017,cw-mongodb2-2-test:27017,cw-mongodb3-2-test:27017" }

databases:

{ "_id" : "admin", "partitioned" : false, "primary" : "config" } { "_id" : "licensedb", "partitioned" : true, "primary" : "cwset" }

licensedb.licenses chunks:
cwset 1
{ "u" :

{ $minKey : 1 }

} -->> { "u" :

{ $maxKey : 1 }

} on : cwset Timestamp(1000, 0)
licensedb.seats chunks:
cwset 1
{ "_machineHash" :

{ $minKey : 1 }

} -->> { "_machineHash" :

{ $maxKey : 1 }

} on : cwset Timestamp(1000, 0)

{ "_id" : "oauthdb", "partitioned" : false, "primary" : "cwset" } { "_id" : "imagedb", "partitioned" : false, "primary" : "cwset" } { "_id" : "certdb", "partitioned" : false, "primary" : "cwset" } { "_id" : "policydb", "partitioned" : true, "primary" : "cwset" }

policydb.policies chunks:
cwset 1
{ "u" :

{ $minKey : 1 }

} -->> { "u" :

{ $maxKey : 1 }

} on : cwset Timestamp(1000, 0)

{ "_id" : "settingsdb", "partitioned" : false, "primary" : "cwset" } { "_id" : "statistics", "partitioned" : false, "primary" : "cwset" } { "_id" : "licenseb", "partitioned" : false, "primary" : "cwset-2" }

=======================================================================
Here is the log from mongos starting before the query and ending after the query:
/usr/local/sbin/mongos(_ZN5mongo27ParallelSortClusteredCursor9startInitEv+0xe11) [0x763581]
/usr/local/sbin/mongos(_ZN5mongo27ParallelSortClusteredCursor8fullInitEv+0x9) [0x769979]
/usr/local/sbin/mongos(_ZN5mongo13ShardStrategy7queryOpERNS_7RequestE+0x472) [0x62a3f2]
/usr/local/sbin/mongos(_ZN5mongo7Request7processEi+0x1fb) [0x5c346b]
/usr/local/sbin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x71) [0x5003f1]
/usr/local/sbin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x411) [0x6b3731]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f8936d04e9a]
Tue Oct 9 21:46:00 [conn1879] warning: db exception when initializing on cwset:cwset/cw-mongodb1-test:27017,cw-mongodb2-test:27017,cw-mongodb3-test:27017, current connection state is { state:

{ conn: "cwset/cw-mongodb1-test:27017,cw-mongodb2-test:27017,cw-mongodb3-test:27017", vinfo: "licensedb.licenses @ 1|0||507499c3e2c7beb86f78c174", cursor: "(none)", count: 0, done: false }

, retryNext: false, init: false, finish: false, errored: false } :: caused by :: 10429 setShardVersion failed host: cw-mongodb1-test:27017

{ oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), errmsg: "exception: all servers down!", code: 8002, ok: 0.0 }

Tue Oct 9 21:46:00 [conn1879] AssertionException while processing op type : 2004 to : licensedb.licenses :: caused by :: 10429 setShardVersion failed host: cw-mongodb1-test:27017

{ oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), errmsg: "exception: all servers down!", code: 8002, ok: 0.0 }

Tue Oct 9 21:46:00 [conn1879] Request::process begin ns: admin.$cmd msg id: 59 op: 2004 attempt: 0
Tue Oct 9 21:46:00 [conn1879] single query: admin.$cmd

{ replSetGetStatus: 1.0, forShell: 1.0 }

ntoreturn: -1 options : 0
Tue Oct 9 21:46:00 [conn1879] Request::process end ns: admin.$cmd msg id: 59 op: 2004

Comment by Spencer Brody (Inactive) [ 09/Oct/12 ]

Can you attach the output of running sh.status() in the shell?

Can you also attach logs from the mongos from when you try to run a query on a sharded collection and see this error message instead?

Generated at Thu Feb 08 03:14:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.