[SERVER-22971] Operations on some sharded collections fail with bogus error Created: 04/Mar/16  Updated: 21/Apr/16  Resolved: 21/Apr/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Robert Bandl Assignee: Kaloian Manassiev
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-23192 mongos and shards will become unusabl... Closed
Related
is related to SERVER-23844 Distributed lock manager should not s... Closed
is related to SERVER-9552 when replica set member has full disk... Backlog
Operating System: ALL
Sprint: Sharding 12 (04/01/16), Sharding 13 (04/22/16)
Participants:

 Description   

We have a 3 member config replica set rs0 + data replica sets rs1 rs2.
For two of existing (sharded) collections mongos errors for MANY operations "None of the hosts for replica set rs0 could be contacted."
For any new sharded collection it ALWAYS errors "None of the hosts for replica set rs0 could be contacted.", but command to create index and shard collection works.
I don't know how long it's been broken because we had insert only workload which seems to work on existing collection however 'find' operations don't always work.
Servers were running 3.2.1 first I encountered the problem, I upgraded config nodes and mongos to 3.2.3 and problem presists.

Existing collection

db.items_views.find()
Errors with "None of the hosts for replica set rs0 could be contacted."
 
db.items_views.find({"_id.u":"HRSHHXNVLX"})
Works, got results
 
db.items_views.find({"_id.u":"PLXYOAOBYD"})
Errors with "None of the hosts for replica set rs0 could be contacted."
 
db.items_views.aggregate([{$match:{"_id.u":"PLXYOAOBYD"}}])
Works, got results (same criteria as find above that failed)

To summarize, 'aggregate' works with any match criteria and operations and yields accurate results (best I can tell), 'find' works only with some criteria.
I can also verify that insert ops work on this collection (this is a production server) as I can see new documents in collection being inserted via

db.items_views.aggregate([{$sort:{t:-1}}])

('t' being field doc was inserted). Also docs matching criteria that fails on 'find' are being inserted.

For new collection 'zzz'

mongos> db.zzz.createIndex({b:1})
{
        "raw" : {
                "rs1/mongodb-shard-1a:27018,mongodb-shard-1b:27018" : {
                        "createdCollectionAutomatically" : true,
                        "numIndexesBefore" : 1,
                        "numIndexesAfter" : 2,
                        "ok" : 1,
                        "$gleStats" : {
                                "lastOpTime" : Timestamp(1457118410, 2),
                                "electionId" : ObjectId("569eddaf0000000000000002")
                        }
                }
        },
        "ok" : 1
}
mongos> db.zzz.find()
mongos> db.zzz.insert({b:1})
WriteResult({ "nInserted" : 1 })
mongos> db.zzz.find()
{ "_id" : ObjectId("56d9dd0aba83a6081b06f52c"), "b" : 1 }
mongos> use admin
switched to db admin
mongos> db.runCommand({shardCollection:"squid.zzz", key:{b:1}})
{ "collectionsharded" : "squid.zzz", "ok" : 1 }
mongos> use squid
switched to db squid
mongos> db.zzz.find()
Error: error: {
        "ok" : 0,
        "errmsg" : "None of the hosts for replica set rs0 could be contacted.",
        "code" : 71
}
mongos> db.zzz.insert({b:2})
WriteResult({
        "nInserted" : 0,
        "writeError" : {
                "code" : 82,
                "errmsg" : "no progress was made executing batch write op in squid.zzz after 5 rounds (0 ops completed in 6 rounds total)"
        }
})

mongos log during new collection sharding

2016-03-04T19:08:38.316+0000 I COMMAND  [conn1] CMD: shardcollection: { shardCollection: "squid.zzz", key: { b: 1.0 } }
2016-03-04T19:08:38.324+0000 I SHARDING [conn1] distributed lock 'squid.zzz' acquired for 'shardCollection', ts : 56d9dd36ec32ad7738d81095
2016-03-04T19:08:38.327+0000 I SHARDING [conn1] about to log metadata event into changelog: { _id: "ip-172-31-21-1-2016-03-04T19:08:38.327+0000-56d9dd36ec32ad7738d81096", server: "ip-172-31-21-1", clientAddr: "127.0.0.1:59936", time: new Date(1457118518327), what: "shardCollection.start", ns: "squid.zzz", details: { shardKey: { b: 1.0 }, collection: "squid.zzz", primary: "rs1:rs1/mongodb-shard-1a:27018,mongodb-shard-1b:27018", initShards: [], numChunks: 1 } }
2016-03-04T19:08:38.331+0000 I SHARDING [conn1] going to create 1 chunk(s) for: squid.zzz using new epoch 56d9dd36ec32ad7738d81097
2016-03-04T19:08:38.338+0000 I SHARDING [conn1] ChunkManager: time to load chunks for squid.zzz: 2ms sequenceNumber: 16 version: 1|0||56d9dd36ec32ad7738d81097 based on: (empty)
2016-03-04T19:08:38.343+0000 I SHARDING [conn1] about to log metadata event into changelog: { _id: "ip-172-31-21-1-2016-03-04T19:08:38.343+0000-56d9dd36ec32ad7738d81098", server: "ip-172-31-21-1", clientAddr: "127.0.0.1:59936", time: new Date(1457118518343), what: "shardCollection.end", ns: "squid.zzz", details: { version: "1|0||56d9dd36ec32ad7738d81097" } }
2016-03-04T19:08:38.352+0000 I SHARDING [conn1] distributed lock with ts: 56d9dd36ec32ad7738d81095' unlocked.

Only error I can see in log is on startup

2016-03-04T18:52:06.727+0000 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: LockStateChangeFailed: findAndModify query predicate didn't match any lock document

but also says afterwards

2016-03-04T18:52:06.755+0000 I SHARDING [Balancer] config servers and shards contacted successfully

Obviously "None of the hosts for replica set rs0 could be contacted." error message is nonsense and this is not a connectivity issue, but rather something else. Please advise on how to debug this issue.



 Comments   
Comment by Kaloian Manassiev [ 21/Apr/16 ]

Hi cikovd,

Thank you again for reporting this issue and for your patience while we investigate.

Based on our investigation, there are two things we can improve here in order to prevent this problem from happening in the future and I have filed tickets for both. The first (and larger) problem is described in SERVER-23192. Basically if the replica set disappears for more than 30 seconds, mongos and other mongods will deem it as permanently failed and from then onward will keep reporting "None of the hosts for replica set rs0 could be contacted". We are currently working on improving this behaviour.

The second problem is described in SERVER-23844. It would have prevented your disks from filling up with that error message.

I am going to resolve this ticket and link it to these two tickets, which you can monitor for progress.

Again, thank you for reporting this problem.

Best regards,
-Kal.

Comment by Robert Bandl [ 10/Mar/16 ]

Well resulting output is all

2016-02-14T16:58:32.183+0000 W SHARDING [replSetDistLockPinger] Failed to unlock lock with ts: 56b76197c2c7299929f207fa :: caused by :: ReplicaSetNotFound None of the hosts for replica set rs0 could be contacted.

at 30 second intervals for a week. No other messages. It seems that there was a connectivity issue between the shard and config server unrelated to mongod which we didn't notice because mongoses had no problem connecting to shards and configs from their hosts. It's all good now I guess, I suppose you can close this ticket and just make sure logged errors indicate which host is having a problem and why (instead of having it say on all nodes 'None of the hosts for replica set rs0 could be contacted.' when error is actually that log file disk on one of the shards is full).

Comment by Kaloian Manassiev [ 10/Mar/16 ]

Thanks Robert. The presence of this many locks, which need to be released indicates that there is some task, which is continuously trying to acquire a lock and fails for some reason so the list of locks to be freed keeps growing.

Would it be possible that you grep the mongos log for any lines which contain one of these lock ids and attach the output? For example:

grep '56b76197c2c7299929f207fa' /var/log/mongodb/mongodb/log > 56b76197c2c7299929f207fa.log

I feel that we can do a better job of not trying to unlock locks, which we know that we definitely not acquired in the first place. This will significantly reduce the size of the list that we need to print.

Thank you very much in advance for your help and patience while we get to the bottom of this.

Best regards,
-Kal.

Comment by Robert Bandl [ 10/Mar/16 ]

Message you referred to as being written to log every 30 seconds is 'pinging failed' message as seen below. You need to throttle the 'Failed to unlock lock' message.

2016-02-14T17:02:02.274+0000 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: ReplicaSetNotFound None of the hosts for replica set rs0 could be contacted.
2016-02-14T17:02:32.288+0000 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: ReplicaSetNotFound None of the hosts for replica set rs0 could be contacted.
2016-02-14T17:03:02.302+0000 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: ReplicaSetNotFound None of the hosts for replica set rs0 could be contacted.

Comment by Robert Bandl [ 10/Mar/16 ]

Hi, as you can see in tiny snippet below 'failed to unlock' message was definately not being written at most once every 30 seconds:

2016-02-14T17:03:02.307+0000 W SHARDING [replSetDistLockPinger] Failed to unlock lock with ts: 56b76192c2c7299929f207f2 :: caused by :: ReplicaSetNotFound None of the hosts for replica set rs0 could be contacted.
2016-02-14T17:03:02.307+0000 W SHARDING [replSetDistLockPinger] Failed to unlock lock with ts: 56b76193c2c7299929f207f3 :: caused by :: ReplicaSetNotFound None of the hosts for replica set rs0 could be contacted.
2016-02-14T17:03:02.307+0000 W SHARDING [replSetDistLockPinger] Failed to unlock lock with ts: 56b76193c2c7299929f207f4 :: caused by :: ReplicaSetNotFound None of the hosts for replica set rs0 could be contacted.
2016-02-14T17:03:02.307+0000 W SHARDING [replSetDistLockPinger] Failed to unlock lock with ts: 56b76193c2c7299929f207f5 :: caused by :: ReplicaSetNotFound None of the hosts for replica set rs0 could be contacted.
2016-02-14T17:03:02.307+0000 W SHARDING [replSetDistLockPinger] Failed to unlock lock with ts: 56b76195c2c7299929f207f6 :: caused by :: ReplicaSetNotFound None of the hosts for replica set rs0 could be contacted.
2016-02-14T17:03:02.308+0000 W SHARDING [replSetDistLockPinger] Failed to unlock lock with ts: 56b76196c2c7299929f207f7 :: caused by :: ReplicaSetNotFound None of the hosts for replica set rs0 could be contacted.
2016-02-14T17:03:02.308+0000 W SHARDING [replSetDistLockPinger] Failed to unlock lock with ts: 56b76196c2c7299929f207f8 :: caused by :: ReplicaSetNotFound None of the hosts for replica set rs0 could be contacted.
2016-02-14T17:03:02.308+0000 W SHARDING [replSetDistLockPinger] Failed to unlock lock with ts: 56b76196c2c7299929f207f9 :: caused by :: ReplicaSetNotFound None of the hosts for replica set rs0 could be contacted.
2016-02-14T17:03:02.308+0000 W SHARDING [replSetDistLockPinger] Failed to unlock lock with ts: 56b76197c2c7299929f207fa :: caused by :: ReplicaSetNotFound None of the hosts for replica set rs0 could be contacted.
2016-02-14T17:03:02.308+0000 W SHARDING [replSetDistLockPinger] Failed to unlock lock with ts: 56b76197c2c7299929f207fb :: caused by :: ReplicaSetNotFound None of the hosts for replica set rs0 could be contacted.

A couple of per millisecond warning counts:

grep '2016-02-14T17:03:02.306' /var/log/mongodb/mongod.log | grep 'Failed to unlock lock' | wc -l
123

grep '2016-02-14T17:03:02.307' /var/log/mongodb/mongod.log | grep 'Failed to unlock lock' | wc -l
103

grep '2016-02-14T17:03:02.308' /var/log/mongodb/mongod.log | grep 'Failed to unlock lock' | wc -l
136

grep '2016-02-14T17:03:02.309' /var/log/mongodb/mongod.log | grep 'Failed to unlock lock' | wc -l
133

grep 'Failed to unlock lock' /var/log/mongodb/mongod.log | wc -l
28469058

Comment by Kaloian Manassiev [ 10/Mar/16 ]

Hi cikovd,

The "failed to unlock" warning message should be printed at most once every 30 seconds for a very small number of locks and at this frequency it is unlikely to cause the disk drive to get full. We could certainly make it print less like you suggested, but is it possible that your disk got full for an unrelated reason?

Would it be possible to attach a snippet of the log which includes timestamps and which shows this message getting printed more frequently?

Best regards,
-Kal.

Comment by Andy Schwerin [ 08/Mar/16 ]

3) Ideally mongod should automatically step down when it can't write to disk because its full

This particular suggestion is a duplicate of SERVER-9552. Feel free to vote, watch and comment there.

I believe kaloian.manassiev is going to respond to your other suggestions under separate cover. Thanks for the report and suggestions.

Comment by Robert Bandl [ 04/Mar/16 ]

Found the problem.

Issue is seems to be resolved after telling primary on rs1 to step down, primary disk drive is full (not data drive, data is on other drive). mongod has flooded the log which is on primary drive with a gazzilion messages like

SHARDING [replSetDistLockPinger] Failed to unlock lock with ts: 56c083d5c2c7299929f20b0a :: caused by :: ReplicaSetNotFound None of the hosts for replica set rs0 could be contacted.

It has been writing more than thousands of messages per second to log with incrementing ts I guess at some point it had a problem connecting and disk got filled up now.

I was sure problem is between mongos and config servers due to misleading error message given by mongos.
You should:
1) Have it write such error message at most once per second instead of filling up the disk on the shard
2) Have mongos say that its showing error message from the one of shards not its own problem
3) Ideally mongod should automatically step down when it can't write to disk because its full

and I should improve platform monitoring

Thanks for your efforts in resolving this problem.

Comment by Robert Bandl [ 04/Mar/16 ]

Found something very interesting, this is partial sh.status() on mongos.

sh.status()

mongos> sh.status()
--- Sharding Status ---
  sharding version: {
        "_id" : 1,
        "minCompatibleVersion" : 5,
        "currentVersion" : 6,
        "clusterId" : ObjectId("55f99bf47fd42ee804dcf9c1")
}
  shards:
        {  "_id" : "rs1",  "host" : "rs1/mongodb-shard-1a:27018,mongodb-shard-1b:27018",  "tags" : [ "ITEMS-PL", "ITEMS-IT" ] }
        {  "_id" : "rs2",  "host" : "rs2/mongodb-shard-2a:27018,mongodb-shard-2b:27018",  "tags" : [ "ITEMS-EN", "ITEMS-FR" ] }
  active mongoses:
        "3.2.1" : 2
        "3.2.3" : 1
  balancer:
        Currently enabled:  yes
        Currently running:  no
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours:
                No recent migrations
  databases:
        {  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
        {  "_id" : "squid",  "partitioned" : true,  "primary" : "rs1" }
		....... other collections from our application collections, test collections below .....
				squid.items_views
                        shard key: { "_id.u" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                rs1     13
                                rs2     13
                        { "_id.u" : { "$minKey" : 1 } } -->> { "_id.u" : "FRBPWXZVBW" } on : rs2 Timestamp(2, 0)
                        { "_id.u" : "FRBPWXZVBW" } -->> { "_id.u" : "PLAOSHTKWS" } on : rs2 Timestamp(3, 0)
                        { "_id.u" : "PLAOSHTKWS" } -->> { "_id.u" : "PLBKCACKSR" } on : rs2 Timestamp(4, 0)
                        { "_id.u" : "PLBKCACKSR" } -->> { "_id.u" : "PLEABCOYXC" } on : rs2 Timestamp(5, 0)
                        { "_id.u" : "PLEABCOYXC" } -->> { "_id.u" : "PLGELRSJTSKG" } on : rs2 Timestamp(6, 0)
                        { "_id.u" : "PLGELRSJTSKG" } -->> { "_id.u" : "PLJGMETDNN" } on : rs2 Timestamp(7, 0)
                        { "_id.u" : "PLJGMETDNN" } -->> { "_id.u" : "PLJLLLZTYR" } on : rs2 Timestamp(8, 0)
                        { "_id.u" : "PLJLLLZTYR" } -->> { "_id.u" : "PLLTIYWHKG" } on : rs2 Timestamp(9, 0)
                        { "_id.u" : "PLLTIYWHKG" } -->> { "_id.u" : "PLMGXOCFDK" } on : rs2 Timestamp(10, 0)
                        { "_id.u" : "PLMGXOCFDK" } -->> { "_id.u" : "PLNGFRECZI" } on : rs2 Timestamp(11, 0)
                        { "_id.u" : "PLNGFRECZI" } -->> { "_id.u" : "PLPAZOJYSI" } on : rs2 Timestamp(12, 0)
                        { "_id.u" : "PLPAZOJYSI" } -->> { "_id.u" : "PLPCVBELYT" } on : rs2 Timestamp(13, 0)
                        { "_id.u" : "PLPCVBELYT" } -->> { "_id.u" : "PLPRDLBYBX" } on : rs2 Timestamp(14, 0)
                        { "_id.u" : "PLPRDLBYBX" } -->> { "_id.u" : "PLPSMXRHMH" } on : rs1 Timestamp(14, 1)
                        { "_id.u" : "PLPSMXRHMH" } -->> { "_id.u" : "PLTDPTRKTG" } on : rs1 Timestamp(1, 15)
                        { "_id.u" : "PLTDPTRKTG" } -->> { "_id.u" : "PLTIYARYOI" } on : rs1 Timestamp(1, 16)
                        { "_id.u" : "PLTIYARYOI" } -->> { "_id.u" : "PLWIJZNKZX" } on : rs1 Timestamp(1, 17)
                        { "_id.u" : "PLWIJZNKZX" } -->> { "_id.u" : "PLWPUIRWWH" } on : rs1 Timestamp(1, 18)
                        { "_id.u" : "PLWPUIRWWH" } -->> { "_id.u" : "PLXAHLSOXY" } on : rs1 Timestamp(1, 19)
                        { "_id.u" : "PLXAHLSOXY" } -->> { "_id.u" : "PLYWGRURCM" } on : rs1 Timestamp(1, 20)
                        { "_id.u" : "PLYWGRURCM" } -->> { "_id.u" : "SECNZFDUYM" } on : rs1 Timestamp(1, 21)
                        { "_id.u" : "SECNZFDUYM" } -->> { "_id.u" : "SEFNLWWKME" } on : rs1 Timestamp(1, 22)
                        { "_id.u" : "SEFNLWWKME" } -->> { "_id.u" : "SEGJLMKTJC" } on : rs1 Timestamp(1, 23)
                        { "_id.u" : "SEGJLMKTJC" } -->> { "_id.u" : "SEIMWYDAZX" } on : rs1 Timestamp(1, 24)
                        { "_id.u" : "SEIMWYDAZX" } -->> { "_id.u" : "SGDCBDFWIH" } on : rs1 Timestamp(1, 25)
                        { "_id.u" : "SGDCBDFWIH" } -->> { "_id.u" : { "$maxKey" : 1 } } on : rs1 Timestamp(1, 26)
                squid.zzz
                        shard key: { "b" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                rs1     1
                        { "b" : { "$minKey" : 1 } } -->> { "b" : { "$maxKey" : 1 } } on : rs1 Timestamp(1, 0)

Every find with criteria on '_id.u' I tried that resolves to chunk on rs2 works.
Every find that resolves to rs1 or has to query both shards errors with "None of the hosts for replica set rs0 could be contacted."

So:

These work

db.items_views.find({"_id.u":"A"})
db.items_views.find({"_id.u":"FRBPWXZVBW"})
db.items_views.find({"_id.u":"PLPRDLBYBW"})

These throw error

db.items_views.find({"_id.u":"PLPSMXRHMH"})
db.items_views.find({"_id.u":"SECNZFDUYM"})
db.items_views.find({"_id.u":"Z"})

Comment by Robert Bandl [ 04/Mar/16 ]

Thats correct. rs0 is config server's replica set, rs1 and rs2 are shard replica sets.

All of the mongos instances are having the same problem - here's a log from freshly restarted mongos - only ran db.zzz.find() on db 'squid' after restart which errored with "None of the hosts for replica set rs0 could be contacted." (test collection I created earlier as described above).

2016-03-04T20:46:37.876+0000 I CONTROL  [main] ***** SERVER RESTARTED *****
2016-03-04T20:46:37.880+0000 I CONTROL  [main] ** WARNING: You are running this process as the root user, which is not recommended.
2016-03-04T20:46:37.880+0000 I CONTROL  [main]
2016-03-04T20:46:37.881+0000 I SHARDING [mongosMain] MongoS version 3.2.3 starting: pid=32130 port=27017 64-bit host=ip-172-31-21-1 (--help for usage)
2016-03-04T20:46:37.881+0000 I CONTROL  [mongosMain] db version v3.2.3
2016-03-04T20:46:37.881+0000 I CONTROL  [mongosMain] git version: b326ba837cf6f49d65c2f85e1b70f6f31ece7937
2016-03-04T20:46:37.881+0000 I CONTROL  [mongosMain] OpenSSL version: OpenSSL 1.0.2d 9 Jul 2015
2016-03-04T20:46:37.881+0000 I CONTROL  [mongosMain] allocator: tcmalloc
2016-03-04T20:46:37.881+0000 I CONTROL  [mongosMain] modules: none
2016-03-04T20:46:37.881+0000 I CONTROL  [mongosMain] build environment:
2016-03-04T20:46:37.881+0000 I CONTROL  [mongosMain]     distmod: ubuntu1404
2016-03-04T20:46:37.881+0000 I CONTROL  [mongosMain]     distarch: x86_64
2016-03-04T20:46:37.881+0000 I CONTROL  [mongosMain]     target_arch: x86_64
2016-03-04T20:46:37.881+0000 I CONTROL  [mongosMain] options: { config: "/etc/mongos.conf", net: { port: 27017 }, processManagement: { fork: true }, sharding: { configDB: "rs0/mongodb-config-a:27019,mongodb-config-b:27019,mongodb-config-c:27019" }, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb/mongos.log" } }
2016-03-04T20:46:37.881+0000 I SHARDING [mongosMain] Updating config server connection string to: rs0/mongodb-config-a:27019,mongodb-config-b:27019,mongodb-config-c:27019
2016-03-04T20:46:37.881+0000 I NETWORK  [mongosMain] Starting new replica set monitor for rs0/mongodb-config-a:27019,mongodb-config-b:27019,mongodb-config-c:27019
2016-03-04T20:46:37.881+0000 I NETWORK  [ReplicaSetMonitorWatcher] starting
2016-03-04T20:46:37.882+0000 I SHARDING [thread1] creating distributed lock ping thread for process ip-172-31-21-1:27017:1457124397:2018572209 (sleeping for 30000ms)
2016-03-04T20:46:37.891+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to mongodb-config-a:27019
2016-03-04T20:46:37.892+0000 I NETWORK  [mongosMain] Starting new replica set monitor for rs1/mongodb-shard-1a:27018,mongodb-shard-1b:27018
2016-03-04T20:46:37.892+0000 I NETWORK  [mongosMain] Starting new replica set monitor for rs2/mongodb-shard-2a:27018,mongodb-shard-2b:27018
2016-03-04T20:46:37.894+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to mongodb-config-c:27019
2016-03-04T20:46:37.895+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to mongodb-config-c:27019
2016-03-04T20:46:37.897+0000 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: LockStateChangeFailed: findAndModify query predicate didn't match any lock document
2016-03-04T20:46:37.906+0000 I SHARDING [Balancer] about to contact config servers and shards
2016-03-04T20:46:37.906+0000 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-03-04T20:46:37.914+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to mongodb-shard-1a:27018
2016-03-04T20:46:37.921+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to mongodb-shard-2a:27018
2016-03-04T20:46:37.922+0000 I SHARDING [Balancer] config servers and shards contacted successfully
2016-03-04T20:46:37.922+0000 I SHARDING [Balancer] balancer id: ip-172-31-21-1:27017 started
2016-03-04T20:46:37.929+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to mongodb-config-b:27019
2016-03-04T20:46:37.930+0000 I NETWORK  [mongosMain] waiting for connections on port 27017
2016-03-04T20:46:37.936+0000 I SHARDING [Balancer] distributed lock 'balancer' acquired for 'doing balance round', ts : 56d9f42db6e3e1ee1d9ec3e8
...... bunch of "ChunkManager: time to load chunks" for our application collections ......
2016-03-04T20:46:37.961+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to mongodb-config-b:27019
...... bunch of "ChunkManager: time to load chunks" for our application collections ......
2016-03-04T20:46:37.986+0000 I SHARDING [Balancer] ChunkManager: time to load chunks for squid.zyx: 0ms sequenceNumber: 13 version: 1|0||56d9d0a44b9199b1d1f556da based on: (empty)
2016-03-04T20:46:37.987+0000 I SHARDING [Balancer] ChunkManager: time to load chunks for squid.zyxx: 0ms sequenceNumber: 14 version: 1|0||56d9d1224b9199b1d1f556f4 based on: (empty)
2016-03-04T20:46:37.989+0000 I SHARDING [Balancer] ChunkManager: time to load chunks for squid.zyxxx: 1ms sequenceNumber: 15 version: 1|0||56d9d15b4b9199b1d1f55701 based on: (empty)
2016-03-04T20:46:37.989+0000 I SHARDING [Balancer] ChunkManager: time to load chunks for squid.zzz: 0ms sequenceNumber: 16 version: 1|0||56d9dd36ec32ad7738d81097 based on: (empty)
2016-03-04T20:46:38.021+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to mongodb-config-a:27019
2016-03-04T20:46:38.054+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-172-31-21-1-2016-03-04T20:46:38.054+0000-56d9f42eb6e3e1ee1d9ec3e9", server: "ip-172-31-21-1", clientAddr: "", time: new Date(1457124398054), what: "balancer.round", ns: "", details: { executionTimeMillis: 130, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } }
2016-03-04T20:46:38.063+0000 I SHARDING [Balancer] distributed lock with ts: 56d9f42db6e3e1ee1d9ec3e8' unlocked.
2016-03-04T20:46:43.035+0000 I NETWORK  [mongosMain] connection accepted from 127.0.0.1:60086 #1 (1 connection now open)
2016-03-04T20:46:45.142+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Successfully connected to mongodb-shard-1a:27018
2016-03-04T20:46:46.436+0000 I NETWORK  [conn1] end connection 127.0.0.1:60086 (0 connections now open)
2016-03-04T20:46:48.081+0000 I SHARDING [Balancer] distributed lock 'balancer' acquired for 'doing balance round', ts : 56d9f438b6e3e1ee1d9ec3ea
2016-03-04T20:46:48.158+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-172-31-21-1-2016-03-04T20:46:48.158+0000-56d9f438b6e3e1ee1d9ec3eb", server: "ip-172-31-21-1", clientAddr: "", time: new Date(1457124408158), what: "balancer.round", ns: "", details: { executionTimeMillis: 90, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } }
2016-03-04T20:46:48.171+0000 I SHARDING [Balancer] distributed lock with ts: 56d9f438b6e3e1ee1d9ec3ea' unlocked.

Comment by Kaloian Manassiev [ 04/Mar/16 ]

Apologies for misreading your original description. Just to make sure I understand correctly - rs0 is your config server's replica set, not one of the shards, correct?

Would it be possible to attach the logs from the mongos instance, which is having these problems? I want to search for any messages from the the component which monitors the replica sets.

Thank you for your patience while we investigate this problem.

Best regards,
-Kal.

Comment by Robert Bandl [ 04/Mar/16 ]

As I said - I upgraded mongos and all config nodes from 3.2.1 to 3.2.3 after first encountering the problem to see if that resolves the issue - that process restarted all of those instances. I also tried flushRouterConfig command now - still same results with above examples.

Comment by Kaloian Manassiev [ 04/Mar/16 ]

Hi Robert,

The error which you are seeing indicates that none of the hosts for the replica set named 'rs0' (which stores the key PLXYOAOBYD) was unavailable for some time and the mongos instance gave up on monitoring it. Does this sound in line with what might have happened in the past?

That being said, my hypothesis is that the aggregation pipeline is more aggressive in contacting potentially failed hosts than the find command (different code path) and that's why it succeeds.

While we investigate further, would it be possible that you run the 'flushRouterConfig' command and see if your find operations work? This is how to run the command:

use admin;
db.runCommand({flushRouterConfig: 1});

If this doesn't work, would it be possible that you restart the mongos instance and see if this gets the find command to work?

Best regards,
-Kal.

Generated at Thu Feb 08 04:01:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.