Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.2.3
Component/s: Sharding
Labels:
None

Operating System:
ALL
Sprint:
Sharding 12 (04/01/16), Sharding 13 (04/22/16)
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We have a 3 member config replica set rs0 + data replica sets rs1 rs2.
For two of existing (sharded) collections mongos errors for MANY operations "None of the hosts for replica set rs0 could be contacted."
For any new sharded collection it ALWAYS errors "None of the hosts for replica set rs0 could be contacted.", but command to create index and shard collection works.
I don't know how long it's been broken because we had insert only workload which seems to work on existing collection however 'find' operations don't always work.
Servers were running 3.2.1 first I encountered the problem, I upgraded config nodes and mongos to 3.2.3 and problem presists.

Existing collection

db.items_views.find()
Errors with "None of the hosts for replica set rs0 could be contacted."

db.items_views.find({"_id.u":"HRSHHXNVLX"})
Works, got results

db.items_views.find({"_id.u":"PLXYOAOBYD"})
Errors with "None of the hosts for replica set rs0 could be contacted."

db.items_views.aggregate([{$match:{"_id.u":"PLXYOAOBYD"}}])
Works, got results (same criteria as find above that failed)

To summarize, 'aggregate' works with any match criteria and operations and yields accurate results (best I can tell), 'find' works only with some criteria.
I can also verify that insert ops work on this collection (this is a production server) as I can see new documents in collection being inserted via

db.items_views.aggregate([{$sort:{t:-1}}])

('t' being field doc was inserted). Also docs matching criteria that fails on 'find' are being inserted.

For new collection 'zzz'

mongos> db.zzz.createIndex({b:1})
{
        "raw" : {
                "rs1/mongodb-shard-1a:27018,mongodb-shard-1b:27018" : {
                        "createdCollectionAutomatically" : true,
                        "numIndexesBefore" : 1,
                        "numIndexesAfter" : 2,
                        "ok" : 1,
                        "$gleStats" : {
                                "lastOpTime" : Timestamp(1457118410, 2),
                                "electionId" : ObjectId("569eddaf0000000000000002")
                        }
                }
        },
        "ok" : 1
}
mongos> db.zzz.find()
mongos> db.zzz.insert({b:1})
WriteResult({ "nInserted" : 1 })
mongos> db.zzz.find()
{ "_id" : ObjectId("56d9dd0aba83a6081b06f52c"), "b" : 1 }
mongos> use admin
switched to db admin
mongos> db.runCommand({shardCollection:"squid.zzz", key:{b:1}})
{ "collectionsharded" : "squid.zzz", "ok" : 1 }
mongos> use squid
switched to db squid
mongos> db.zzz.find()
Error: error: {
        "ok" : 0,
        "errmsg" : "None of the hosts for replica set rs0 could be contacted.",
        "code" : 71
}
mongos> db.zzz.insert({b:2})
WriteResult({
        "nInserted" : 0,
        "writeError" : {
                "code" : 82,
                "errmsg" : "no progress was made executing batch write op in squid.zzz after 5 rounds (0 ops completed in 6 rounds total)"
        }
})

mongos log during new collection sharding

2016-03-04T19:08:38.316+0000 I COMMAND  [conn1] CMD: shardcollection: { shardCollection: "squid.zzz", key: { b: 1.0 } }
2016-03-04T19:08:38.324+0000 I SHARDING [conn1] distributed lock 'squid.zzz' acquired for 'shardCollection', ts : 56d9dd36ec32ad7738d81095
2016-03-04T19:08:38.327+0000 I SHARDING [conn1] about to log metadata event into changelog: { _id: "ip-172-31-21-1-2016-03-04T19:08:38.327+0000-56d9dd36ec32ad7738d81096", server: "ip-172-31-21-1", clientAddr: "127.0.0.1:59936", time: new Date(1457118518327), what: "shardCollection.start", ns: "squid.zzz", details: { shardKey: { b: 1.0 }, collection: "squid.zzz", primary: "rs1:rs1/mongodb-shard-1a:27018,mongodb-shard-1b:27018", initShards: [], numChunks: 1 } }
2016-03-04T19:08:38.331+0000 I SHARDING [conn1] going to create 1 chunk(s) for: squid.zzz using new epoch 56d9dd36ec32ad7738d81097
2016-03-04T19:08:38.338+0000 I SHARDING [conn1] ChunkManager: time to load chunks for squid.zzz: 2ms sequenceNumber: 16 version: 1|0||56d9dd36ec32ad7738d81097 based on: (empty)
2016-03-04T19:08:38.343+0000 I SHARDING [conn1] about to log metadata event into changelog: { _id: "ip-172-31-21-1-2016-03-04T19:08:38.343+0000-56d9dd36ec32ad7738d81098", server: "ip-172-31-21-1", clientAddr: "127.0.0.1:59936", time: new Date(1457118518343), what: "shardCollection.end", ns: "squid.zzz", details: { version: "1|0||56d9dd36ec32ad7738d81097" } }
2016-03-04T19:08:38.352+0000 I SHARDING [conn1] distributed lock with ts: 56d9dd36ec32ad7738d81095' unlocked.

Only error I can see in log is on startup

2016-03-04T18:52:06.727+0000 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: LockStateChangeFailed: findAndModify query predicate didn't match any lock document

but also says afterwards

2016-03-04T18:52:06.755+0000 I SHARDING [Balancer] config servers and shards contacted successfully

Obviously "None of the hosts for replica set rs0 could be contacted." error message is nonsense and this is not a connectivity issue, but rather something else. Please advise on how to debug this issue.

duplicates

SERVER-23192 mongos and shards will become unusable if contact is lost with all CSRS config server nodes for more than 30 consecutive failed attempts to contact

Closed

is related to

SERVER-23844 Distributed lock manager should not schedule unlock if the lock acquisition did not result in a write

Closed

SERVER-9552 when replica set member has full disk, step down to (sec|rec)?

Backlog

Assignee:: Kaloian Manassiev
Reporter:: Robert Bandl
Participants:: Andy Schwerin, Kaloian Manassiev, Robert Bandl
Votes:: 0 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: Mar 04 2016 07:54:52 PM UTC
Updated:: Apr 21 2016 02:20:31 PM UTC
Resolved:: Apr 21 2016 02:02:24 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates