[SERVER-27793] 'exception while parsing isMaster reply: BSONElement: bad type 101' causes empty shard connection string to be written to config.shards Created: 23/Jan/17  Updated: 27/Oct/23  Resolved: 10/Mar/17

Status: Closed
Project: Core Server
Component/s: MMAPv1
Affects Version/s: 2.6.12, 3.0.14
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jing Wu Assignee: Kaloian Manassiev
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongos.log     Text File mongos_a32.log    
Issue Links:
Related
is related to SERVER-27949 Failure to parse isMaster response in... Closed
Operating System: ALL
Participants:

 Description   

I have 22 shards cluster running for 2 years.

Without any reason, one of shards was missing host:port info, which caused application read faliures.

Can you see if it's known issue ?

Mongo is running on V2.6.12.

mongos> db.shards.find();

{ "_id" : "TT-RS01", "host" : "TT-RS01/tdb-a01.white.xxx.com:4308,tdb-m01.white.xxx.com:4308" } { "_id" : "TT-RS02", "host" : "TT-RS02/tdb-a02.white.xxx.com:4308,tdb-m02.white.xxx.com:4308" } { "_id" : "TT-RS03", "host" : "TT-RS03/tdb-a03.white.xxx.com:4308,tdb-m03.white.xxx.com:4308" } { "_id" : "TT-RS05", "host" : "TT-RS05/tdb-a05.white.xxx.com:4308,tdb-m05.white.xxx.com:4308" } { "_id" : "TT-RS06", "host" : "TT-RS06/tdb-a06.white.xxx.com:4308,tdb-m06.white.xxx.com:4308" } { "_id" : "TT-RS04", "host" : "TT-RS04/tdb-a04.white.xxx.com:4308,tdb-m04.white.xxx.com:4308" } { "_id" : "TT-RS07", "host" : "TT-RS07/" } { "_id" : "TT-RS08", "host" : "TT-RS08/tdb-a08.white.xxx.com:4308,tdb-m08.white.xxx.com:4308" } { "_id" : "TT-RS09", "host" : "TT-RS09/tdb-a09.white.xxx.com:4308,tdb-m09.white.xxx.com:4308" } { "_id" : "TT-RS10", "host" : "TT-RS10/tdb-a10.white.xxx.com:4308,tdb-m10.white.xxx.com:4308" } { "_id" : "TT-RS11", "host" : "TT-RS11/tdb-a11.white.xxx.com:4308,tdb-m11.white.xxx.com:4308" } { "_id" : "TT-RS12", "host" : "TT-RS12/tdb-a12.white.xxx.com:4308,tdb-m12.white.xxx.com:4308" } { "_id" : "TT-RS13", "host" : "TT-RS13/tdb-a13.white.xxx.com:4308,tdb-m13.white.xxx.com:4308" } { "_id" : "TT-RS14", "host" : "TT-RS14/tdb-a14.white.xxx.com:4308,tdb-m14.white.xxx.com:4308" } { "_id" : "TT-RS15", "host" : "TT-RS15/tdb-a15.white.xxx.com:4308,tdb-m15.white.xxx.com:4308" } { "_id" : "TT-RS16", "host" : "TT-RS16/tdb-a16.white.xxx.com:4308,tdb-m16.white.xxx.com:4308" } { "_id" : "TT-RS17", "host" : "TT-RS17/tdb-a17.white.xxx.com:4308,tdb-m17.white.xxx.com:4308" } { "_id" : "TT-RS18", "host" : "TT-RS18/tdb-a18.white.xxx.com:4308,tdb-m18.white.xxx.com:4308" } { "_id" : "TT-RS19", "host" : "TT-RS19/tdb-a19.white.xxx.com:4308,tdb-m19.white.xxx.com:4308" } { "_id" : "TT-RS20", "host" : "TT-RS20/tdb-a20.white.xxx.com:4308,tdb-m20.white.xxx.com:4308" }

Type "it" for more

I did manually update shard07 entry.
db.shards.update(

{ "_id" : "TT-RS07"}

,

{"_id": "TT-RS07", "host": "TT-RS07/tdb-a07.white.xxx.com:4308,tdb-m07.white.xxx.com:4308"}

);
mongos> db.shards.find();

{ "_id" : "TT-RS01", "host" : "TT-RS01/tdb-a01.white.xxx.com:4308,tdb-m01.white.xxx.com:4308" } { "_id" : "TT-RS02", "host" : "TT-RS02/tdb-a02.white.xxx.com:4308,tdb-m02.white.xxx.com:4308" } { "_id" : "TT-RS03", "host" : "TT-RS03/tdb-a03.white.xxx.com:4308,tdb-m03.white.xxx.com:4308" } { "_id" : "TT-RS05", "host" : "TT-RS05/tdb-a05.white.xxx.com:4308,tdb-m05.white.xxx.com:4308" } { "_id" : "TT-RS06", "host" : "TT-RS06/tdb-a06.white.xxx.com:4308,tdb-m06.white.xxx.com:4308" } { "_id" : "TT-RS04", "host" : "TT-RS04/tdb-a04.white.xxx.com:4308,tdb-m04.white.xxx.com:4308" } { "_id" : "TT-RS07", "host" : "TT-RS07/tdb-a07.white.xxx.com:4308,tdb-m07.white.xxx.com:4308" } { "_id" : "TT-RS08", "host" : "TT-RS08/tdb-a08.white.xxx.com:4308,tdb-m08.white.xxx.com:4308" } { "_id" : "TT-RS09", "host" : "TT-RS09/tdb-a09.white.xxx.com:4308,tdb-m09.white.xxx.com:4308" } { "_id" : "TT-RS10", "host" : "TT-RS10/tdb-a10.white.xxx.com:4308,tdb-m10.white.xxx.com:4308" } { "_id" : "TT-RS11", "host" : "TT-RS11/tdb-a11.white.xxx.com:4308,tdb-m11.white.xxx.com:4308" } { "_id" : "TT-RS12", "host" : "TT-RS12/tdb-a12.white.xxx.com:4308,tdb-m12.white.xxx.com:4308" } { "_id" : "TT-RS13", "host" : "TT-RS13/tdb-a13.white.xxx.com:4308,tdb-m13.white.xxx.com:4308" } { "_id" : "TT-RS14", "host" : "TT-RS14/tdb-a14.white.xxx.com:4308,tdb-m14.white.xxx.com:4308" } { "_id" : "TT-RS15", "host" : "TT-RS15/tdb-a15.white.xxx.com:4308,tdb-m15.white.xxx.com:4308" } { "_id" : "TT-RS16", "host" : "TT-RS16/tdb-a16.white.xxx.com:4308,tdb-m16.white.xxx.com:4308" } { "_id" : "TT-RS17", "host" : "TT-RS17/tdb-a17.white.xxx.com:4308,tdb-m17.white.xxx.com:4308" } { "_id" : "TT-RS18", "host" : "TT-RS18/tdb-a18.white.xxx.com:4308,tdb-m18.white.xxx.com:4308" } { "_id" : "TT-RS19", "host" : "TT-RS19/tdb-a19.white.xxx.com:4308,tdb-m19.white.xxx.com:4308" } { "_id" : "TT-RS20", "host" : "TT-RS20/tdb-a20.white.xxx.com:4308,tdb-m20.white.xxx.com:4308" }

Type "it" for more



 Comments   
Comment by Kelsey Schubert [ 10/Mar/17 ]

Hi jing66wu@yahoo.com,

As per Kaloian's last comment, I'm resolving this ticket since the issue appears to be caused by a hardware defect. Please let us know if you see the error again after removing a32 from your cluster.

Kind regards,
Thomas

Comment by Kaloian Manassiev [ 09/Feb/17 ]

The error messages will look the same, but I am confident that if a32 is out of the picture you will not experience it anymore.

Comment by Jing Wu [ 09/Feb/17 ]

Kal,

We are setting up monitor job to watch mongos.log to capture this issue. Since I upgrade server to V3.2.12, I am not sure if error message will change..

Can you let me know what's mongos error messages will be look like after config.shards collection is updated with empty host:port ?

Thanks,

Jing

Comment by Kaloian Manassiev [ 08/Feb/17 ]

Thanks - this might indicate some hardware problem with a32.

Please see if the problem continues happening with this mongos host taken offline. In the mean time I have filed SERVER-27949 to ensure that one faulty mongos cannot write bad data impacting other cluster members.

-Kal.

Comment by Jing Wu [ 08/Feb/17 ]

Kal,

The 3 occurances were happened on mongos on a32. The 2nd and 3rd were not on isMatser() issue, but did complain bad data, see attached mongos_a32 log.

Thanks,

Jing

Comment by Kaloian Manassiev [ 08/Feb/17 ]

Yes, you can directly upgrade to 3.2 and continue to use the legacy 3-config server setup with MMAP V1, there is no requirement to change the storage engine to WT. Switching the config servers to WT is only required when changing the config server to use replica sets (which we highly recommend you do, but is not required).

... since a32 host is problem host, sa already took it out from production. Will let you know if it will happene again.

That was another suspicion I had - that due to bad hardware, memory contents somehow get corrupted. Is it always the same mongos which is experiencing the "exception while parsing isMaster reply" error?

Best regards,
-Kal.

Comment by Jing Wu [ 07/Feb/17 ]

Kal,

Reagarding upgrade to V3.2, can config db still using MMAPv1 engine while the mongod uses WT engine ?
Documentation is not very clear about it.

I plan to upgrade Mongo cluster to V3.2.12 with MMAPV1 engine instead of WT engine. since a32 host is problem host, sa already took it out from production. Will let you know if it will happene again.
Thanks,

Jing

Comment by Kaloian Manassiev [ 07/Feb/17 ]

There is no significant difference between sharding in 2.6 and 3.0. Version 3.2 is where we did more significant refactoring and cleanup of the sharding code. I suspect that this upgrade by itself should take care of the issue you are experiencing, but even if it doesn't, like I mentioned it contains additional checks which are extremely hard to deduce from looking at logs.

If you would like us to continue investigating this, you will have to upgrade to at least 3.2.12.

-Kal.

Comment by Jing Wu [ 07/Feb/17 ]

Kal,

The issue on 1/20 was on V2.6.12, but 2nd and 3rd time when it happaned , the mongo clusters and mongos are all on V3.0.14.
Please look at the logs again.

Thanks,

Jing

Comment by Kaloian Manassiev [ 07/Feb/17 ]

Hi jing66wu@yahoo.com,

Thanks for providing the latest logs. The information they contain is more useful and from there I can see that there was a problem parsing the isMaster response from the shard primary and that's why it was interpreted as if there are no hosts:

2017-01-20T17:46:55.492-0500 [ReplicaSetMonitorWatcher] exception while parsing isMaster reply: BSONElement: bad type 101 { setName: "TT-RS13", setVersion: 5, ismaster: false, secondary: true, hosts: [ "tdb-m13.white.xxx.com:4308", "tdb-a13.white.xxx.com:4308" ], arbiters: [ "itsdb-l03.white.xxx.com:27036", "itsdb-m03.white.xxx.com:27036" ], primary: "tdb-a13.white.xxx.com:4308", me: "tdb-m13.white.xxx.com:4308", maxBsonObjectSize: 16777216, maxMessageSizeBytes: 48000000, maxWriteBatchSize: 1000, localTime: new Date(1484952415476), maxWireVersion: 2, minWireVersion: 0, ok: 1.0 }

This indicates message content corruption for some reason, but I am unable to pinpoint the cause from just the logs nor I am aware of cases where in-flight messages would get corrupted.

Like I mentioned before, there have been significant changes in the replica set monitoring starting in version 3.2 and also there are tighter checks, which I am hopeful will pinpoint the place where data gets corrupted earlier so he have something better to work with. I urge you to upgrade until at least till that version and see whether the problem still reproduces.

The version of mongos, which you are running is still 2.6.12 and it is no longer supported:

2017-01-20T17:07:53.401-0500 [mongosMain] MongoS version 2.6.12 starting: pid=10381 port=4356 64-bit host=its-a32.white.xxx.com (--help for usage)

Best regards,
-Kal.

Comment by Jing Wu [ 07/Feb/17 ]

Kal,

Can you look at upload logs and let me know what you find ?

Thanks,

Jing

Comment by Jing Wu [ 07/Feb/17 ]

Kal,

all those updated happened only on a32 host. the rest of mongos hosts have no messages match "changing host to" at all.

I just looked at mongos log. if it's detecting issue which certain shard unavailable , updated metadata, why didn't it update value back ?

Please see attached mongos log.

Thanks,

Jing

Comment by Jing Wu [ 07/Feb/17 ]

Kal,

I asked sa to check mongos log before. I just search myself today, I did find the metadata update msgs on one of mongos log:

2017-01-20T20:31:41.098-0500 [ReplicaSetMonitorWatcher] changing hosts to TT-RS07/ from TT-RS07/tdb-a07.white.xxx.com:4308,tdb-m07.white.xxx.com:4308
2017-01-25T20:08:22.981-0500 [Balancer] MaxChunkSize changing from 64MB to 1852990835MB
2017-02-02T23:41:26.746-0500 I NETWORK [ReplicaSetMonitorWatcher] changing hosts to TT-RS13/ from TT-RS13/tdb-a13.white.xxx.com:4308,tdb-m13.white.xxx.com:4308
2017-02-05T04:44:40.508-0500 I NETWORK [ReplicaSetMonitorWatcher] changing hosts to TT-RS04/ from TT-RS04/tdb-a04.white.xxx.com:4308,tdb-m04.white.xxx.com:4308

Thanks,

Jing

Comment by Jing Wu [ 06/Feb/17 ]

Kal,

Logging for error msgs are changed for V3.0.14.. what's msgs should I search for metadata updates ? Will metadata update msgs show up in one of mongos lof or all of mongos log ?

I am in process of upgrade to V3.2, have to go to V3.0.14 first , then convert to WT engine , then V3.2. It takes time for production system.

Thanks,

Jing

Comment by Kaloian Manassiev [ 06/Feb/17 ]

Hi jing66wu@yahoo.com,

I am unable to pinpoint a specific piece of code or see a bug which would cause a replica set to return an empty list of nodes. Also I do not see any shard metadata updates happening in the log, which you attached.

In MongoDB version 3.2 we significantly improved the replica set monitoring code, fixed race conditions and added better diagnostics. Therefore, I strongly reccomend that you upgrade to the latest version of MongoDB 3.2.

Would you please upgrade to MongoDB 3.2.12 and confirm if this issue is resolved?

Best regards,
-Kal.

Comment by Jing Wu [ 05/Feb/17 ]

2017-02-05T06:24:19.814-0500 I NETWORK  [Balancer] scoped connection to tconfigdb-m02.db.xxx.com:4307,tconfigdb-d01.db.xxx.com:4307,tconfigdb-m01.db.xxx.com:4307 not being returned to the pool
2017-02-05T06:24:19.815-0500 I SHARDING [Balancer] caught exception while doing balance: Empty host component parsing HostAndPort from ""
2017-02-05T06:24:24.169-0500 W NETWORK  [conn965] db exception when initializing on TT-RS22:TT-RS22/tdb-a42.white.xxx.com:4308,tdb-m22.white.xxx.com:4308, current connection state is { state: { conn: "TT-RS22/tdb-a42.white.xxx.com:4308,tdb-m22.white.xxx.com:4308", vinfo: "historyDb.History @ 3290|1||55c0dbaf5b591f0b0cfcee5f", cursor: "(none)", count: 0, done: false }, retryNext: false, init: false, finish: false, errored: false } :: caused by :: 9 Empty host component parsing HostAndPort from ""
2017-02-05T06:24:24.174-0500 I -        [conn965] Assertion failure _name.size() src/mongo/s/shard.h 88
2017-02-05T06:24:24.183-0500 I CONTROL  [conn965]
 0xa980a2 0xa3a8a9 0xa1e22a 0x96a831 0x9e0ed1 0x8ff15d 0x9005cd 0x908eb3 0x99de61 0x9200b7 0x9b09cc 0x99d40d 0x5e475a 0xa46f49 0x3326607aa1 0x33262e8aad
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"6980A2","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"63A8A9","s":"_ZN5mongo10logContextEPKc"},{"b":"400000","o":"61E22A","s":"_ZN5mongo12verifyFailedEPKcS1_j"},{"b":"400000","o":"56A831","s":"_ZNK5mongo21DBClientShardResolver15chooseWriteHostERKSsPNS_16ConnectionStringE"},{"b":"400000","o":"5E0ED1","s":"_ZN5mongo14BatchWriteExec12executeBatchERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseEPNS_19BatchWriteExecStatsE"},{"b":"400000","o":"4FF15D","s":"_ZN5mongo13ClusterWriter10shardWriteERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE"},{"b":"400000","o":"5005CD","s":"_ZN5mongo13ClusterWriter5writeERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE"},{"b":"400000","o":"508EB3","s":"_ZN5mongo15ClusterWriteCmd3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb"},{"b":"400000","o":"59DE61","s":"_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb"},{"b":"400000","o":"5200B7","s":"_ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi"},{"b":"400000","o":"5B09CC","s":"_ZN5mongo8Strategy15clientCommandOpERNS_7RequestE"},{"b":"400000","o":"59D40D","s":"_ZN5mongo7Request7processEi"},{"b":"400000","o":"1E475A","s":"_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE"},{"b":"400000","o":"646F49","s":"_ZN5mongo17PortMessageServer17handleIncomingMsgEPv"},{"b":"3326600000","o":"7AA1"},{"b":"3326200000","o":"E8AAD","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.0.14", "gitVersion" : "08352afcca24bfc145240a0fac9d28b978ab77f3", "uname" : { "sysname" : "Linux", "release" : "2.6.32-642.13.1.el6.x86_64", "version" : "#1 SMP Wed Jan 11 20:56:24 UTC 2017", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "E2E4D3C2B1703CEF484CCFE185BDA8AB474CCF52" }, { "b" : "7FFEC626E000", "elfType" : 3, "buildId" : "B7D50C2161B6674BD26DB5BA69B12A619488BBC5" }, { "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "B3BD4C596D72FCBE4607C86FEEC14F47B46D0DCC" }, { "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "7CE5058BE3F2AE6EBC0DF61E790C3FCED1A34C7B" }, { "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "096F60DC43484ECA364206125CED7D563FF9D027" }, { "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "86FE5BC1F46B8F8AA9A7A479FF991900DB93F720" }, { "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "045D39F19533291EBD72D0EE0247F9D49BE2521E" }, { "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "B60EF3FCE5D4D2D8BAD2585D5CAAA1167B35DBFD" }, { "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "9A6E4BDFA184364D81F7DFD789474C3FB8F98A00" }, { "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "F14F3F5B55FC325FDFA7FB6F22F73CA1C6977379" }, { "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "DB04BDF55FE66BADD1081D833C17B4881E0EEDB9" }, { "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "9B852585C66329AA02EFB28497E652A40F538E78" }, { "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "EF3AACAFD6BF71BB861F194C1559153FB0B020E2" }, { "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "152E2C18A7A2145021A8A879A01A82EE134E3946" }, { "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "DDE6774979156442185836150FC0785170F8001F" }, { "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "5FA8E5038EC04A774AF72A9BB62DC86E1049C4D6" }, { "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "A23DAFBCE170763BF1E836A8B26113F9CD20E0DA" }, { "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "AF374BAFB7F5B139A0B431D3F06D82014AFF3251" }, { "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "DDF6449707FD4C49DDE32A293EEE9AC218BFC460" }, { "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "B4576BE308DDCF7BC31F7304E4734C3D846D0236" } ] }}
 mongos(_ZN5mongo15printStackTraceERSo+0x32) [0xa980a2]
 mongos(_ZN5mongo10logContextEPKc+0xE9) [0xa3a8a9]
 mongos(_ZN5mongo12verifyFailedEPKcS1_j+0xDA) [0xa1e22a]
 mongos(_ZNK5mongo21DBClientShardResolver15chooseWriteHostERKSsPNS_16ConnectionStringE+0x4B1) [0x96a831]
 mongos(_ZN5mongo14BatchWriteExec12executeBatchERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseEPNS_19BatchWriteExecStatsE+0x351) [0x9e0ed1]
 mongos(_ZN5mongo13ClusterWriter10shardWriteERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE+0x30D) [0x8ff15d]
 mongos(_ZN5mongo13ClusterWriter5writeERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE+0x5FD) [0x9005cd]
 mongos(_ZN5mongo15ClusterWriteCmd3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x2C3) [0x908eb3]
 mongos(_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0x311) [0x99de61]
 mongos(_ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi+0x227) [0x9200b7]
 mongos(_ZN5mongo8Strategy15clientCommandOpERNS_7RequestE+0x1DC) [0x9b09cc]
 mongos(_ZN5mongo7Request7processEi+0x60D) [0x99d40d]
 mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x5A) [0x5e475a]
 mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x339) [0xa46f49]
 libpthread.so.0(+0x7AA1) [0x3326607aa1]
 libc.so.6(clone+0x6D) [0x33262e8aad]

Comment by Jing Wu [ 05/Feb/17 ]

Hi Kal,

Current mongo version is V3.0.14.
It's very strange, the issue happened again this morning on shard04, different shard again.
Can you find out what's root cause for this missing shards information ?

Thanks,

Jing

Comment by Jing Wu [ 04/Feb/17 ]

Kal,

The first time is shard07, this time is shard13.

There are over 100 mongos for this cluster, please see attched file from one of mongos.log.

You can see error msg from 1/20 when shard07 was removed and 2/2 when shard13 was removed.

Thanks,

Jing

Comment by Kaloian Manassiev [ 03/Feb/17 ]

Hi jing66wu@yahoo.com,

Apologies for the delayed reply.

Is it always the same shard, which is experiencing this problem (TT-RS07)? Have you confirmed with your SA that as part of some maintenance they are not manually modifying the config.shards collection or that you don't have some rogue process doing that?

If you are certain that the collection is not being modified outside of the sharding systems, can you please attach the mongos logs files from the time when the above assertion started showing up and also the logs from the mongods, which comprise shard TT-RS07?

Thanks in advance.

-Kal.

Comment by Jing Wu [ 03/Feb/17 ]

Can someone take a look again to see why it's happening ?

Thanks,

Jing

Comment by Jing Wu [ 03/Feb/17 ]

2017-02-02T23:41:36.715-0500 I NETWORK [Balancer] scoped connection to itsconfigdb-m02.db.aol.com:4307,itsconfigdb-d01.db.aol.com:4307,itsconfigdb-m01.db.aol.com:4307 not being returned to the pool
2017-02-02T23:41:36.715-0500 I SHARDING [Balancer] caught exception while doing balance: Empty host component parsing HostAndPort from ""
2017-02-02T23:41:45.245-0500 W NETWORK [conn725] db exception when initializing on TT-RS20:TT-RS20/tdb-a20.white.xxx.com:4308,tdb-m20.white.xxx.com:4308, current connection state is { state:

{ conn: "TT-RS20/tdb-a20.white.aol.com:4308,tdb-m20.white.xxx.com:4308", vinfo: "historyDb.History @ 3285|25||55c0dbaf5b591f0b0cfcee5f", cursor: "(none)", count: 0, done: false }

, retryNext: false, init: false, finish: false, errored: false } :: caused by :: 9 Empty host component parsing HostAndPort from ""
2017-02-02T23:41:45.294-0500 I - [conn725] Assertion failure _name.size() src/mongo/s/shard.h 88
2017-02-02T23:41:45.311-0500 I CONTROL [conn725]
0xa980a2 0xa3a8a9 0xa1e22a 0x96a831 0x9e0ed1 0x8ff15d 0x9005cd 0x908eb3 0x99de61 0x9200b7 0x9b09cc 0x99d40d 0x5e475a 0xa46f49 0x3326607aa1 0x33262e8aad
----- BEGIN BACKTRACE -----
{"backtrace":[

{"b":"400000","o":"6980A2","s":"_ZN5mongo15printStackTraceERSo"}

,

{"b":"400000","o":"63A8A9","s":"_ZN5mongo10logContextEPKc"}

,

{"b":"400000","o":"61E22A","s":"_ZN5mongo12verifyFailedEPKcS1_j"}

,

{"b":"400000","o":"56A831","s":"_ZNK5mongo21DBClientShardResolver15chooseWriteHostERKSsPNS_16ConnectionStringE"}

,

{"b":"400000","o":"5E0ED1","s":"_ZN5mongo14BatchWriteExec12executeBatchERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseEPNS_19BatchWriteExecStatsE"}

,

{"b":"400000","o":"4FF15D","s":"_ZN5mongo13ClusterWriter10shardWriteERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE"}

,

{"b":"400000","o":"5005CD","s":"_ZN5mongo13ClusterWriter5writeERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE"}

,

{"b":"400000","o":"508EB3","s":"_ZN5mongo15ClusterWriteCmd3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb"}

,

{"b":"400000","o":"59DE61","s":"_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb"}

,

{"b":"400000","o":"5200B7","s":"_ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi"}

,

{"b":"400000","o":"5B09CC","s":"_ZN5mongo8Strategy15clientCommandOpERNS_7RequestE"}

,

{"b":"400000","o":"59D40D","s":"_ZN5mongo7Request7processEi"}

,

{"b":"400000","o":"1E475A","s":"_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE"}

,

{"b":"400000","o":"646F49","s":"_ZN5mongo17PortMessageServer17handleIncomingMsgEPv"}

,

{"b":"3326600000","o":"7AA1"}

,

{"b":"3326200000","o":"E8AAD","s":"clone"}

],"processInfo":{ "mongodbVersion" : "3.0.14", "gitVersion" : "08352afcca24bfc145240a0fac9d28b978ab77f3", "uname" :

{ "sysname" : "Linux", "release" : "2.6.32-642.13.1.el6.x86_64", "version" : "#1 SMP Wed Jan 11 20:56:24 UTC 2017", "machine" : "x86_64" }

, "somap" : [

{ "elfType" : 2, "b" : "400000", "buildId" : "E2E4D3C2B1703CEF484CCFE185BDA8AB474CCF52" }

,

{ "b" : "7FFEC626E000", "elfType" : 3, "buildId" : "B7D50C2161B6674BD26DB5BA69B12A619488BBC5" }

,

{ "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "B3BD4C596D72FCBE4607C86FEEC14F47B46D0DCC" }

,

{ "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "7CE5058BE3F2AE6EBC0DF61E790C3FCED1A34C7B" }

,

{ "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "096F60DC43484ECA364206125CED7D563FF9D027" }

,

{ "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "86FE5BC1F46B8F8AA9A7A479FF991900DB93F720" }

,

{ "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "045D39F19533291EBD72D0EE0247F9D49BE2521E" }

,

{ "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "B60EF3FCE5D4D2D8BAD2585D5CAAA1167B35DBFD" }

,

{ "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "9A6E4BDFA184364D81F7DFD789474C3FB8F98A00" }

,

{ "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "F14F3F5B55FC325FDFA7FB6F22F73CA1C6977379" }

,

{ "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "DB04BDF55FE66BADD1081D833C17B4881E0EEDB9" }

,

{ "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "9B852585C66329AA02EFB28497E652A40F538E78" }

,

{ "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "EF3AACAFD6BF71BB861F194C1559153FB0B020E2" }

,

{ "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "152E2C18A7A2145021A8A879A01A82EE134E3946" }

,

{ "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "DDE6774979156442185836150FC0785170F8001F" }

,

{ "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "5FA8E5038EC04A774AF72A9BB62DC86E1049C4D6" }

,

{ "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "A23DAFBCE170763BF1E836A8B26113F9CD20E0DA" }

,

{ "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "AF374BAFB7F5B139A0B431D3F06D82014AFF3251" }

,

{ "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "DDF6449707FD4C49DDE32A293EEE9AC218BFC460" }

,

{ "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "B4576BE308DDCF7BC31F7304E4734C3D846D0236" }

] }}
mongos(_ZN5mongo15printStackTraceERSo+0x32) [0xa980a2]
mongos(_ZN5mongo10logContextEPKc+0xE9) [0xa3a8a9]
mongos(_ZN5mongo12verifyFailedEPKcS1_j+0xDA) [0xa1e22a]
mongos(_ZNK5mongo21DBClientShardResolver15chooseWriteHostERKSsPNS_16ConnectionStringE+0x4B1) [0x96a831]
mongos(_ZN5mongo14BatchWriteExec12executeBatchERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseEPNS_19BatchWriteExecStatsE+0x351) [0x9e0ed1]
mongos(_ZN5mongo12verifyFailedEPKcS1_j+0xDA) [0xa1e22a]
mongos(_ZNK5mongo21DBClientShardResolver15chooseWriteHostERKSsPNS_16ConnectionStringE+0x4B1) [0x96a831]
mongos(_ZN5mongo14BatchWriteExec12executeBatchERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseEPNS_19BatchWriteExecStatsE+0x351) [0x9e0ed1]
mongos(_ZN5mongo13ClusterWriter10shardWriteERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE+0x30D) [0x8ff15d]
mongos(_ZN5mongo13ClusterWriter5writeERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE+0x5FD) [0x9005cd]
mongos(_ZN5mongo15ClusterWriteCmd3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x2C3) [0x908eb3]
mongos(_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0x311) [0x99de61]
mongos(_ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi+0x227) [0x9200b7]
mongos(_ZN5mongo8Strategy15clientCommandOpERNS_7RequestE+0x1DC) [0x9b09cc]
mongos(_ZN5mongo7Request7processEi+0x60D) [0x99d40d]
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x5A) [0x5e475a]
mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x339) [0xa46f49]
libpthread.so.0(+0x7AA1) [0x3326607aa1]
libc.so.6(clone+0x6D) [0x33262e8aad]

Comment by Jing Wu [ 03/Feb/17 ]

I upgrade V2.612 to V3.0.14 on Monday..

Today, the same issue happened again, shard13 was missing host:port without reason:

{ "_id" : "TT-RS12", "host" : "ITT-RS12/tdb-a12.white.aol.com:4308,tdb-m12.white.xxx.com:4308" } { "_id" : "TT-RS13", "host" : "TT-RS13/" } { "_id" : "TT-RS14", "host" : "ITT-RS14/tdb-a14.white.aol.com:4308,tdb-m13.white.xxx.com:4308" }
Comment by Jing Wu [ 25/Jan/17 ]

Thanks Karl,

I asked sa to search all mongos he owned, he did not find any instances of the word "changing" in any of the mongos logs . He
did a case-insensitive search.

We didn't aware of any network issue at that time. Each shard has 3 nodes which are in different availability zones/datacenter. I just look at mongod log, there was no any heatbeat faiiure messges among 3 nodes in this shard07 either.

Jing

Comment by Kaloian Manassiev [ 25/Jan/17 ]

Hi jing66wu@yahoo.com,

There is a component running on each mongos node, which monitors the state of the shards' replica sets and whenever it notices a change, it writes the new connection string to config.shards. It is possible that there was a period during which none of the nodes for TT-RS07 were accessible and that's why it wrote an empty set of nodes.

There is logging around these changes, which should start with this message: changing hosts to <new hosts> from <old hosts>.

Would it be possible to grep through the logs on all your mongos and shards and see if such message appears for TT-RS07? If you find it, do you mind attaching the relevant log file?

Thanks in advance.

Best regards,
-Kal.

Comment by Jing Wu [ 23/Jan/17 ]

here is messages from mongos log when it happened:

2017-01-20T20:31:43.980-0500 [Balancer] Assertion: 13110:HostAndPort: host is empty
2017-01-20T20:31:43.989-0500 [Balancer] 0xb17956 0xad1d58 0xab8e6b 0x73fe63 0x73e978 0xa29a3a 0xa2e164 0x96df19 0xabb355 0xb60be4 0x39790079d1 0x3978ce88fd
/opt/local/pkg/mongodb_current/bin/mongos(_ZN5mongo15printStackTraceERSo+0x26) [0xb17956]
/opt/local/pkg/mongodb_current/bin/mongos(_ZN5mongo10logContextEPKc+0x188) [0xad1d58]
/opt/local/pkg/mongodb_current/bin/mongos(_ZN5mongo11msgassertedEiPKc+0xcb) [0xab8e6b]
/opt/local/pkg/mongodb_current/bin/mongos(_ZN5mongo11HostAndPortC1ERKSs+0xe3) [0x73fe63]
/opt/local/pkg/mongodb_current/bin/mongos(_ZN5mongo16ConnectionString12_fillServersESs+0x2b8) [0x73e978]
/opt/local/pkg/mongodb_current/bin/mongos(_ZN5mongo5Shard8_setAddrERKSs+0x9a) [0xa29a3a]
/opt/local/pkg/mongodb_current/bin/mongos(_ZN5mongo15StaticShardInfo6reloadEv+0x594) [0xa2e164]
/opt/local/pkg/mongodb_current/bin/mongos(_ZN5mongo8Balancer3runEv+0x539) [0x96df19]
/opt/local/pkg/mongodb_current/bin/mongos(_ZN5mongo13BackgroundJob7jobBodyEv+0x105) [0xabb355]
/opt/local/pkg/mongodb_current/bin/mongos() [0xb60be4]
/lib64/libpthread.so.0() [0x39790079d1]
/lib64/libc.so.6(clone+0x6d) [0x3978ce88fd]
2017-01-20T20:31:43.989-0500 [Balancer] scoped connection to tdbconfig-m02.db.xxx.com:4307,tdbconfig-d01.db.xxx.com:4307,tdbconfig-m01.db.xxx.com:4307 not being returned to the pool

Generated at Thu Feb 08 04:16:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.