-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.2.3
-
Component/s: Sharding
-
None
-
ALL
-
Sharding 12 (04/01/16), Sharding 13 (04/22/16)
We have a 3 member config replica set rs0 + data replica sets rs1 rs2.
For two of existing (sharded) collections mongos errors for MANY operations "None of the hosts for replica set rs0 could be contacted."
For any new sharded collection it ALWAYS errors "None of the hosts for replica set rs0 could be contacted.", but command to create index and shard collection works.
I don't know how long it's been broken because we had insert only workload which seems to work on existing collection however 'find' operations don't always work.
Servers were running 3.2.1 first I encountered the problem, I upgraded config nodes and mongos to 3.2.3 and problem presists.
db.items_views.find() Errors with "None of the hosts for replica set rs0 could be contacted." db.items_views.find({"_id.u":"HRSHHXNVLX"}) Works, got results db.items_views.find({"_id.u":"PLXYOAOBYD"}) Errors with "None of the hosts for replica set rs0 could be contacted." db.items_views.aggregate([{$match:{"_id.u":"PLXYOAOBYD"}}]) Works, got results (same criteria as find above that failed)
To summarize, 'aggregate' works with any match criteria and operations and yields accurate results (best I can tell), 'find' works only with some criteria.
I can also verify that insert ops work on this collection (this is a production server) as I can see new documents in collection being inserted via
db.items_views.aggregate([{$sort:{t:-1}}])
('t' being field doc was inserted). Also docs matching criteria that fails on 'find' are being inserted.
mongos> db.zzz.createIndex({b:1}) { "raw" : { "rs1/mongodb-shard-1a:27018,mongodb-shard-1b:27018" : { "createdCollectionAutomatically" : true, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1, "$gleStats" : { "lastOpTime" : Timestamp(1457118410, 2), "electionId" : ObjectId("569eddaf0000000000000002") } } }, "ok" : 1 } mongos> db.zzz.find() mongos> db.zzz.insert({b:1}) WriteResult({ "nInserted" : 1 }) mongos> db.zzz.find() { "_id" : ObjectId("56d9dd0aba83a6081b06f52c"), "b" : 1 } mongos> use admin switched to db admin mongos> db.runCommand({shardCollection:"squid.zzz", key:{b:1}}) { "collectionsharded" : "squid.zzz", "ok" : 1 } mongos> use squid switched to db squid mongos> db.zzz.find() Error: error: { "ok" : 0, "errmsg" : "None of the hosts for replica set rs0 could be contacted.", "code" : 71 } mongos> db.zzz.insert({b:2}) WriteResult({ "nInserted" : 0, "writeError" : { "code" : 82, "errmsg" : "no progress was made executing batch write op in squid.zzz after 5 rounds (0 ops completed in 6 rounds total)" } })
2016-03-04T19:08:38.316+0000 I COMMAND [conn1] CMD: shardcollection: { shardCollection: "squid.zzz", key: { b: 1.0 } } 2016-03-04T19:08:38.324+0000 I SHARDING [conn1] distributed lock 'squid.zzz' acquired for 'shardCollection', ts : 56d9dd36ec32ad7738d81095 2016-03-04T19:08:38.327+0000 I SHARDING [conn1] about to log metadata event into changelog: { _id: "ip-172-31-21-1-2016-03-04T19:08:38.327+0000-56d9dd36ec32ad7738d81096", server: "ip-172-31-21-1", clientAddr: "127.0.0.1:59936", time: new Date(1457118518327), what: "shardCollection.start", ns: "squid.zzz", details: { shardKey: { b: 1.0 }, collection: "squid.zzz", primary: "rs1:rs1/mongodb-shard-1a:27018,mongodb-shard-1b:27018", initShards: [], numChunks: 1 } } 2016-03-04T19:08:38.331+0000 I SHARDING [conn1] going to create 1 chunk(s) for: squid.zzz using new epoch 56d9dd36ec32ad7738d81097 2016-03-04T19:08:38.338+0000 I SHARDING [conn1] ChunkManager: time to load chunks for squid.zzz: 2ms sequenceNumber: 16 version: 1|0||56d9dd36ec32ad7738d81097 based on: (empty) 2016-03-04T19:08:38.343+0000 I SHARDING [conn1] about to log metadata event into changelog: { _id: "ip-172-31-21-1-2016-03-04T19:08:38.343+0000-56d9dd36ec32ad7738d81098", server: "ip-172-31-21-1", clientAddr: "127.0.0.1:59936", time: new Date(1457118518343), what: "shardCollection.end", ns: "squid.zzz", details: { version: "1|0||56d9dd36ec32ad7738d81097" } } 2016-03-04T19:08:38.352+0000 I SHARDING [conn1] distributed lock with ts: 56d9dd36ec32ad7738d81095' unlocked.
Only error I can see in log is on startup
2016-03-04T18:52:06.727+0000 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: LockStateChangeFailed: findAndModify query predicate didn't match any lock document
but also says afterwards
2016-03-04T18:52:06.755+0000 I SHARDING [Balancer] config servers and shards contacted successfully
Obviously "None of the hosts for replica set rs0 could be contacted." error message is nonsense and this is not a connectivity issue, but rather something else. Please advise on how to debug this issue.
- duplicates
-
SERVER-23192 mongos and shards will become unusable if contact is lost with all CSRS config server nodes for more than 30 consecutive failed attempts to contact
- Closed
- is related to
-
SERVER-23844 Distributed lock manager should not schedule unlock if the lock acquisition did not result in a write
- Closed
-
SERVER-9552 when replica set member has full disk, step down to (sec|rec)?
- Backlog