[SERVER-27793] 'exception while parsing isMaster reply: BSONElement: bad type 101' causes empty shard connection string to be written to config.shards Created: 23/Jan/17 Updated: 27/Oct/23 Resolved: 10/Mar/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | MMAPv1 |
| Affects Version/s: | 2.6.12, 3.0.14 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jing Wu | Assignee: | Kaloian Manassiev |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
I have 22 shards cluster running for 2 years. Without any reason, one of shards was missing host:port info, which caused application read faliures. Can you see if it's known issue ? Mongo is running on V2.6.12. mongos> db.shards.find(); { "_id" : "TT-RS01", "host" : "TT-RS01/tdb-a01.white.xxx.com:4308,tdb-m01.white.xxx.com:4308" } { "_id" : "TT-RS02", "host" : "TT-RS02/tdb-a02.white.xxx.com:4308,tdb-m02.white.xxx.com:4308" } { "_id" : "TT-RS03", "host" : "TT-RS03/tdb-a03.white.xxx.com:4308,tdb-m03.white.xxx.com:4308" } { "_id" : "TT-RS05", "host" : "TT-RS05/tdb-a05.white.xxx.com:4308,tdb-m05.white.xxx.com:4308" } { "_id" : "TT-RS06", "host" : "TT-RS06/tdb-a06.white.xxx.com:4308,tdb-m06.white.xxx.com:4308" } { "_id" : "TT-RS04", "host" : "TT-RS04/tdb-a04.white.xxx.com:4308,tdb-m04.white.xxx.com:4308" } { "_id" : "TT-RS07", "host" : "TT-RS07/" } { "_id" : "TT-RS08", "host" : "TT-RS08/tdb-a08.white.xxx.com:4308,tdb-m08.white.xxx.com:4308" } { "_id" : "TT-RS09", "host" : "TT-RS09/tdb-a09.white.xxx.com:4308,tdb-m09.white.xxx.com:4308" } { "_id" : "TT-RS10", "host" : "TT-RS10/tdb-a10.white.xxx.com:4308,tdb-m10.white.xxx.com:4308" } { "_id" : "TT-RS11", "host" : "TT-RS11/tdb-a11.white.xxx.com:4308,tdb-m11.white.xxx.com:4308" } { "_id" : "TT-RS12", "host" : "TT-RS12/tdb-a12.white.xxx.com:4308,tdb-m12.white.xxx.com:4308" } { "_id" : "TT-RS13", "host" : "TT-RS13/tdb-a13.white.xxx.com:4308,tdb-m13.white.xxx.com:4308" } { "_id" : "TT-RS14", "host" : "TT-RS14/tdb-a14.white.xxx.com:4308,tdb-m14.white.xxx.com:4308" } { "_id" : "TT-RS15", "host" : "TT-RS15/tdb-a15.white.xxx.com:4308,tdb-m15.white.xxx.com:4308" } { "_id" : "TT-RS16", "host" : "TT-RS16/tdb-a16.white.xxx.com:4308,tdb-m16.white.xxx.com:4308" } { "_id" : "TT-RS17", "host" : "TT-RS17/tdb-a17.white.xxx.com:4308,tdb-m17.white.xxx.com:4308" } { "_id" : "TT-RS18", "host" : "TT-RS18/tdb-a18.white.xxx.com:4308,tdb-m18.white.xxx.com:4308" } { "_id" : "TT-RS19", "host" : "TT-RS19/tdb-a19.white.xxx.com:4308,tdb-m19.white.xxx.com:4308" } { "_id" : "TT-RS20", "host" : "TT-RS20/tdb-a20.white.xxx.com:4308,tdb-m20.white.xxx.com:4308" }Type "it" for more I did manually update shard07 entry. , {"_id": "TT-RS07", "host": "TT-RS07/tdb-a07.white.xxx.com:4308,tdb-m07.white.xxx.com:4308"}); Type "it" for more |
| Comments |
| Comment by Kelsey Schubert [ 10/Mar/17 ] | ||||||||||||||||||||||||
|
As per Kaloian's last comment, I'm resolving this ticket since the issue appears to be caused by a hardware defect. Please let us know if you see the error again after removing a32 from your cluster. Kind regards, | ||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 09/Feb/17 ] | ||||||||||||||||||||||||
|
The error messages will look the same, but I am confident that if a32 is out of the picture you will not experience it anymore. | ||||||||||||||||||||||||
| Comment by Jing Wu [ 09/Feb/17 ] | ||||||||||||||||||||||||
|
Kal, We are setting up monitor job to watch mongos.log to capture this issue. Since I upgrade server to V3.2.12, I am not sure if error message will change.. Can you let me know what's mongos error messages will be look like after config.shards collection is updated with empty host:port ? Thanks, Jing | ||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 08/Feb/17 ] | ||||||||||||||||||||||||
|
Thanks - this might indicate some hardware problem with a32. Please see if the problem continues happening with this mongos host taken offline. In the mean time I have filed -Kal. | ||||||||||||||||||||||||
| Comment by Jing Wu [ 08/Feb/17 ] | ||||||||||||||||||||||||
|
Kal, The 3 occurances were happened on mongos on a32. The 2nd and 3rd were not on isMatser() issue, but did complain bad data, see attached mongos_a32 log. Thanks, Jing | ||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 08/Feb/17 ] | ||||||||||||||||||||||||
|
Yes, you can directly upgrade to 3.2 and continue to use the legacy 3-config server setup with MMAP V1, there is no requirement to change the storage engine to WT. Switching the config servers to WT is only required when changing the config server to use replica sets (which we highly recommend you do, but is not required).
That was another suspicion I had - that due to bad hardware, memory contents somehow get corrupted. Is it always the same mongos which is experiencing the "exception while parsing isMaster reply" error? Best regards, | ||||||||||||||||||||||||
| Comment by Jing Wu [ 07/Feb/17 ] | ||||||||||||||||||||||||
|
Kal, Reagarding upgrade to V3.2, can config db still using MMAPv1 engine while the mongod uses WT engine ? I plan to upgrade Mongo cluster to V3.2.12 with MMAPV1 engine instead of WT engine. since a32 host is problem host, sa already took it out from production. Will let you know if it will happene again. Jing | ||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 07/Feb/17 ] | ||||||||||||||||||||||||
|
There is no significant difference between sharding in 2.6 and 3.0. Version 3.2 is where we did more significant refactoring and cleanup of the sharding code. I suspect that this upgrade by itself should take care of the issue you are experiencing, but even if it doesn't, like I mentioned it contains additional checks which are extremely hard to deduce from looking at logs. If you would like us to continue investigating this, you will have to upgrade to at least 3.2.12. -Kal. | ||||||||||||||||||||||||
| Comment by Jing Wu [ 07/Feb/17 ] | ||||||||||||||||||||||||
|
Kal, The issue on 1/20 was on V2.6.12, but 2nd and 3rd time when it happaned , the mongo clusters and mongos are all on V3.0.14. Thanks, Jing | ||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 07/Feb/17 ] | ||||||||||||||||||||||||
|
Thanks for providing the latest logs. The information they contain is more useful and from there I can see that there was a problem parsing the isMaster response from the shard primary and that's why it was interpreted as if there are no hosts:
This indicates message content corruption for some reason, but I am unable to pinpoint the cause from just the logs nor I am aware of cases where in-flight messages would get corrupted. Like I mentioned before, there have been significant changes in the replica set monitoring starting in version 3.2 and also there are tighter checks, which I am hopeful will pinpoint the place where data gets corrupted earlier so he have something better to work with. I urge you to upgrade until at least till that version and see whether the problem still reproduces. The version of mongos, which you are running is still 2.6.12 and it is no longer supported:
Best regards, | ||||||||||||||||||||||||
| Comment by Jing Wu [ 07/Feb/17 ] | ||||||||||||||||||||||||
|
Kal, Can you look at upload logs and let me know what you find ? Thanks, Jing | ||||||||||||||||||||||||
| Comment by Jing Wu [ 07/Feb/17 ] | ||||||||||||||||||||||||
|
Kal, all those updated happened only on a32 host. the rest of mongos hosts have no messages match "changing host to" at all. I just looked at mongos log. if it's detecting issue which certain shard unavailable , updated metadata, why didn't it update value back ? Please see attached mongos log. Thanks, Jing | ||||||||||||||||||||||||
| Comment by Jing Wu [ 07/Feb/17 ] | ||||||||||||||||||||||||
|
Kal, I asked sa to check mongos log before. I just search myself today, I did find the metadata update msgs on one of mongos log: 2017-01-20T20:31:41.098-0500 [ReplicaSetMonitorWatcher] changing hosts to TT-RS07/ from TT-RS07/tdb-a07.white.xxx.com:4308,tdb-m07.white.xxx.com:4308 Thanks, Jing | ||||||||||||||||||||||||
| Comment by Jing Wu [ 06/Feb/17 ] | ||||||||||||||||||||||||
|
Kal, Logging for error msgs are changed for V3.0.14.. what's msgs should I search for metadata updates ? Will metadata update msgs show up in one of mongos lof or all of mongos log ? I am in process of upgrade to V3.2, have to go to V3.0.14 first , then convert to WT engine , then V3.2. It takes time for production system. Thanks, Jing | ||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 06/Feb/17 ] | ||||||||||||||||||||||||
|
I am unable to pinpoint a specific piece of code or see a bug which would cause a replica set to return an empty list of nodes. Also I do not see any shard metadata updates happening in the log, which you attached. In MongoDB version 3.2 we significantly improved the replica set monitoring code, fixed race conditions and added better diagnostics. Therefore, I strongly reccomend that you upgrade to the latest version of MongoDB 3.2. Would you please upgrade to MongoDB 3.2.12 and confirm if this issue is resolved? Best regards, | ||||||||||||||||||||||||
| Comment by Jing Wu [ 05/Feb/17 ] | ||||||||||||||||||||||||
| ||||||||||||||||||||||||
| Comment by Jing Wu [ 05/Feb/17 ] | ||||||||||||||||||||||||
|
Hi Kal, Current mongo version is V3.0.14. Thanks, Jing | ||||||||||||||||||||||||
| Comment by Jing Wu [ 04/Feb/17 ] | ||||||||||||||||||||||||
|
Kal, The first time is shard07, this time is shard13. There are over 100 mongos for this cluster, please see attched file from one of mongos.log. You can see error msg from 1/20 when shard07 was removed and 2/2 when shard13 was removed. Thanks, Jing | ||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 03/Feb/17 ] | ||||||||||||||||||||||||
|
Apologies for the delayed reply. Is it always the same shard, which is experiencing this problem (TT-RS07)? Have you confirmed with your SA that as part of some maintenance they are not manually modifying the config.shards collection or that you don't have some rogue process doing that? If you are certain that the collection is not being modified outside of the sharding systems, can you please attach the mongos logs files from the time when the above assertion started showing up and also the logs from the mongods, which comprise shard TT-RS07? Thanks in advance. -Kal. | ||||||||||||||||||||||||
| Comment by Jing Wu [ 03/Feb/17 ] | ||||||||||||||||||||||||
|
Can someone take a look again to see why it's happening ? Thanks, Jing | ||||||||||||||||||||||||
| Comment by Jing Wu [ 03/Feb/17 ] | ||||||||||||||||||||||||
|
2017-02-02T23:41:36.715-0500 I NETWORK [Balancer] scoped connection to itsconfigdb-m02.db.aol.com:4307,itsconfigdb-d01.db.aol.com:4307,itsconfigdb-m01.db.aol.com:4307 not being returned to the pool , retryNext: false, init: false, finish: false, errored: false } :: caused by :: 9 Empty host component parsing HostAndPort from "" , {"b":"400000","o":"63A8A9","s":"_ZN5mongo10logContextEPKc"}, {"b":"400000","o":"61E22A","s":"_ZN5mongo12verifyFailedEPKcS1_j"}, {"b":"400000","o":"56A831","s":"_ZNK5mongo21DBClientShardResolver15chooseWriteHostERKSsPNS_16ConnectionStringE"}, {"b":"400000","o":"5E0ED1","s":"_ZN5mongo14BatchWriteExec12executeBatchERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseEPNS_19BatchWriteExecStatsE"}, {"b":"400000","o":"4FF15D","s":"_ZN5mongo13ClusterWriter10shardWriteERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE"}, {"b":"400000","o":"5005CD","s":"_ZN5mongo13ClusterWriter5writeERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE"}, {"b":"400000","o":"508EB3","s":"_ZN5mongo15ClusterWriteCmd3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb"}, {"b":"400000","o":"59DE61","s":"_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb"}, {"b":"400000","o":"5200B7","s":"_ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi"}, {"b":"400000","o":"5B09CC","s":"_ZN5mongo8Strategy15clientCommandOpERNS_7RequestE"}, {"b":"400000","o":"59D40D","s":"_ZN5mongo7Request7processEi"}, {"b":"400000","o":"1E475A","s":"_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE"}, {"b":"400000","o":"646F49","s":"_ZN5mongo17PortMessageServer17handleIncomingMsgEPv"}, {"b":"3326600000","o":"7AA1"}, {"b":"3326200000","o":"E8AAD","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.0.14", "gitVersion" : "08352afcca24bfc145240a0fac9d28b978ab77f3", "uname" : { "sysname" : "Linux", "release" : "2.6.32-642.13.1.el6.x86_64", "version" : "#1 SMP Wed Jan 11 20:56:24 UTC 2017", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "E2E4D3C2B1703CEF484CCFE185BDA8AB474CCF52" }, { "b" : "7FFEC626E000", "elfType" : 3, "buildId" : "B7D50C2161B6674BD26DB5BA69B12A619488BBC5" }, { "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "B3BD4C596D72FCBE4607C86FEEC14F47B46D0DCC" }, { "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "7CE5058BE3F2AE6EBC0DF61E790C3FCED1A34C7B" }, { "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "096F60DC43484ECA364206125CED7D563FF9D027" }, { "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "86FE5BC1F46B8F8AA9A7A479FF991900DB93F720" }, { "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "045D39F19533291EBD72D0EE0247F9D49BE2521E" }, { "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "B60EF3FCE5D4D2D8BAD2585D5CAAA1167B35DBFD" }, { "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "9A6E4BDFA184364D81F7DFD789474C3FB8F98A00" }, { "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "F14F3F5B55FC325FDFA7FB6F22F73CA1C6977379" }, { "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "DB04BDF55FE66BADD1081D833C17B4881E0EEDB9" }, { "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "9B852585C66329AA02EFB28497E652A40F538E78" }, { "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "EF3AACAFD6BF71BB861F194C1559153FB0B020E2" }, { "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "152E2C18A7A2145021A8A879A01A82EE134E3946" }, { "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "DDE6774979156442185836150FC0785170F8001F" }, { "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "5FA8E5038EC04A774AF72A9BB62DC86E1049C4D6" }, { "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "A23DAFBCE170763BF1E836A8B26113F9CD20E0DA" }, { "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "AF374BAFB7F5B139A0B431D3F06D82014AFF3251" }, { "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "DDF6449707FD4C49DDE32A293EEE9AC218BFC460" }, { "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "B4576BE308DDCF7BC31F7304E4734C3D846D0236" } ] }} | ||||||||||||||||||||||||
| Comment by Jing Wu [ 03/Feb/17 ] | ||||||||||||||||||||||||
|
I upgrade V2.612 to V3.0.14 on Monday.. Today, the same issue happened again, shard13 was missing host:port without reason: { "_id" : "TT-RS12", "host" : "ITT-RS12/tdb-a12.white.aol.com:4308,tdb-m12.white.xxx.com:4308" } { "_id" : "TT-RS13", "host" : "TT-RS13/" } { "_id" : "TT-RS14", "host" : "ITT-RS14/tdb-a14.white.aol.com:4308,tdb-m13.white.xxx.com:4308" } | ||||||||||||||||||||||||
| Comment by Jing Wu [ 25/Jan/17 ] | ||||||||||||||||||||||||
|
Thanks Karl, I asked sa to search all mongos he owned, he did not find any instances of the word "changing" in any of the mongos logs . He We didn't aware of any network issue at that time. Each shard has 3 nodes which are in different availability zones/datacenter. I just look at mongod log, there was no any heatbeat faiiure messges among 3 nodes in this shard07 either. Jing | ||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 25/Jan/17 ] | ||||||||||||||||||||||||
|
There is a component running on each mongos node, which monitors the state of the shards' replica sets and whenever it notices a change, it writes the new connection string to config.shards. It is possible that there was a period during which none of the nodes for TT-RS07 were accessible and that's why it wrote an empty set of nodes. There is logging around these changes, which should start with this message: changing hosts to <new hosts> from <old hosts>. Would it be possible to grep through the logs on all your mongos and shards and see if such message appears for TT-RS07? If you find it, do you mind attaching the relevant log file? Thanks in advance. Best regards, | ||||||||||||||||||||||||
| Comment by Jing Wu [ 23/Jan/17 ] | ||||||||||||||||||||||||
|
here is messages from mongos log when it happened: 2017-01-20T20:31:43.980-0500 [Balancer] Assertion: 13110:HostAndPort: host is empty |