Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-22971

Operations on some sharded collections fail with bogus error

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.3
    • Component/s: Sharding
    • None
    • ALL
    • Sharding 12 (04/01/16), Sharding 13 (04/22/16)

      We have a 3 member config replica set rs0 + data replica sets rs1 rs2.
      For two of existing (sharded) collections mongos errors for MANY operations "None of the hosts for replica set rs0 could be contacted."
      For any new sharded collection it ALWAYS errors "None of the hosts for replica set rs0 could be contacted.", but command to create index and shard collection works.
      I don't know how long it's been broken because we had insert only workload which seems to work on existing collection however 'find' operations don't always work.
      Servers were running 3.2.1 first I encountered the problem, I upgraded config nodes and mongos to 3.2.3 and problem presists.

      Existing collection
      db.items_views.find()
      Errors with "None of the hosts for replica set rs0 could be contacted."
      
      db.items_views.find({"_id.u":"HRSHHXNVLX"})
      Works, got results
      
      db.items_views.find({"_id.u":"PLXYOAOBYD"})
      Errors with "None of the hosts for replica set rs0 could be contacted."
      
      db.items_views.aggregate([{$match:{"_id.u":"PLXYOAOBYD"}}])
      Works, got results (same criteria as find above that failed)
      

      To summarize, 'aggregate' works with any match criteria and operations and yields accurate results (best I can tell), 'find' works only with some criteria.
      I can also verify that insert ops work on this collection (this is a production server) as I can see new documents in collection being inserted via

      db.items_views.aggregate([{$sort:{t:-1}}])

      ('t' being field doc was inserted). Also docs matching criteria that fails on 'find' are being inserted.

      For new collection 'zzz'
      mongos> db.zzz.createIndex({b:1})
      {
              "raw" : {
                      "rs1/mongodb-shard-1a:27018,mongodb-shard-1b:27018" : {
                              "createdCollectionAutomatically" : true,
                              "numIndexesBefore" : 1,
                              "numIndexesAfter" : 2,
                              "ok" : 1,
                              "$gleStats" : {
                                      "lastOpTime" : Timestamp(1457118410, 2),
                                      "electionId" : ObjectId("569eddaf0000000000000002")
                              }
                      }
              },
              "ok" : 1
      }
      mongos> db.zzz.find()
      mongos> db.zzz.insert({b:1})
      WriteResult({ "nInserted" : 1 })
      mongos> db.zzz.find()
      { "_id" : ObjectId("56d9dd0aba83a6081b06f52c"), "b" : 1 }
      mongos> use admin
      switched to db admin
      mongos> db.runCommand({shardCollection:"squid.zzz", key:{b:1}})
      { "collectionsharded" : "squid.zzz", "ok" : 1 }
      mongos> use squid
      switched to db squid
      mongos> db.zzz.find()
      Error: error: {
              "ok" : 0,
              "errmsg" : "None of the hosts for replica set rs0 could be contacted.",
              "code" : 71
      }
      mongos> db.zzz.insert({b:2})
      WriteResult({
              "nInserted" : 0,
              "writeError" : {
                      "code" : 82,
                      "errmsg" : "no progress was made executing batch write op in squid.zzz after 5 rounds (0 ops completed in 6 rounds total)"
              }
      })
      
      mongos log during new collection sharding
      2016-03-04T19:08:38.316+0000 I COMMAND  [conn1] CMD: shardcollection: { shardCollection: "squid.zzz", key: { b: 1.0 } }
      2016-03-04T19:08:38.324+0000 I SHARDING [conn1] distributed lock 'squid.zzz' acquired for 'shardCollection', ts : 56d9dd36ec32ad7738d81095
      2016-03-04T19:08:38.327+0000 I SHARDING [conn1] about to log metadata event into changelog: { _id: "ip-172-31-21-1-2016-03-04T19:08:38.327+0000-56d9dd36ec32ad7738d81096", server: "ip-172-31-21-1", clientAddr: "127.0.0.1:59936", time: new Date(1457118518327), what: "shardCollection.start", ns: "squid.zzz", details: { shardKey: { b: 1.0 }, collection: "squid.zzz", primary: "rs1:rs1/mongodb-shard-1a:27018,mongodb-shard-1b:27018", initShards: [], numChunks: 1 } }
      2016-03-04T19:08:38.331+0000 I SHARDING [conn1] going to create 1 chunk(s) for: squid.zzz using new epoch 56d9dd36ec32ad7738d81097
      2016-03-04T19:08:38.338+0000 I SHARDING [conn1] ChunkManager: time to load chunks for squid.zzz: 2ms sequenceNumber: 16 version: 1|0||56d9dd36ec32ad7738d81097 based on: (empty)
      2016-03-04T19:08:38.343+0000 I SHARDING [conn1] about to log metadata event into changelog: { _id: "ip-172-31-21-1-2016-03-04T19:08:38.343+0000-56d9dd36ec32ad7738d81098", server: "ip-172-31-21-1", clientAddr: "127.0.0.1:59936", time: new Date(1457118518343), what: "shardCollection.end", ns: "squid.zzz", details: { version: "1|0||56d9dd36ec32ad7738d81097" } }
      2016-03-04T19:08:38.352+0000 I SHARDING [conn1] distributed lock with ts: 56d9dd36ec32ad7738d81095' unlocked.
      

      Only error I can see in log is on startup

      2016-03-04T18:52:06.727+0000 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: LockStateChangeFailed: findAndModify query predicate didn't match any lock document

      but also says afterwards

      2016-03-04T18:52:06.755+0000 I SHARDING [Balancer] config servers and shards contacted successfully

      Obviously "None of the hosts for replica set rs0 could be contacted." error message is nonsense and this is not a connectivity issue, but rather something else. Please advise on how to debug this issue.

            Assignee:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Reporter:
            cikovd Robert Bandl
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: