[SERVER-27864] SlaveOK reads against SCCC config server fail with error "all servers down/unreachable when querying" Created: 31/Jan/17  Updated: 06/Dec/17  Resolved: 14/Mar/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.10
Fix Version/s: 3.2.13

Type: Bug Priority: Major - P3
Reporter: Yoni Douek Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-25445 Write command explain on mirror confi... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2017-03-27
Participants:
Case:

 Description   

Performing 'SlaveOK' reads against an SCCC config server fails with the following error:

mongos> use config;
switched to db config
mongos> db.getMongo().setReadPref('secondaryPreferred')
mongos> db.chunks.count()
2017-02-03T14:26:42.603-0500 E QUERY    [thread1] Error: count failed: {
        "code" : 6,
        "ok" : 0,
        "errmsg" : "all servers down/unreachable when querying: kaloianmdesktop:20002,kaloianmdesktop:20003,kaloianmdesktop:20004"
} :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
DBQuery.prototype.count@src/mongo/shell/query.js:383:11
DBCollection.prototype.count@src/mongo/shell/collection.js:1700:12
@(shell):1:1

The reason is explained in this comment.

These are the errors which show up in the log:

2017-01-31T07:28:40.004+0000 I NETWORK  [conn4] query on admin.$cmd: { query: "1", help: 1 } failed to: in.dbcfg1.mydomain.com:27019 (10.0.97.139) exception: "query" had the wrong type. Expected Object, found String
2017-01-31T07:28:40.004+0000 I NETWORK  [conn4] query on admin.$cmd: { query: "1", help: 1 } failed to: in.dbcfg2.mydomain.com:27019 (10.0.97.140) exception: "query" had the wrong type. Expected Object, found String
2017-01-31T07:28:40.004+0000 I NETWORK  [conn4] query on admin.$cmd: { query: "1", help: 1 } failed to: in.dbcfg3.mydomain.com:27019 (10.0.97.141) exception: "query" had the wrong type. Expected Object, found String
2017-01-31T07:28:40.004+0000 W NETWORK  [conn4] db exception when initializing on config, current connection state is { state: { conn: "SyncClusterConnection  [in.dbcfg1.mydomain.com:27019 (10.0.97.139),in.dbcfg2.mydomain.com:27019 (10.0.97.140),in.dbcfg3.mydomain.com:27019 (10.0.97.141)]", vinfo: "config:in.dbcfg1.mydomain.com:27019,in.dbcfg2.mydomain.com:27019,in.dbcfg3.mydomain.com:27019", cursor: "(empty)", count: 0, done: false }, retryNext: false, init: false, finish: false, errored: false } :: caused by :: 6 all servers down/unreachable when querying: in.dbcfg1.mydomain.com:27019,in.dbcfg2.mydomain.com:27019,in.dbcfg3.mydomain.com:27019



 Comments   
Comment by Githook User [ 14/Mar/17 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-27864 Handle $cmd commands embedded in the query field properly in SCCC
Branch: v3.2
https://github.com/mongodb/mongo/commit/73786ec28461d4ae0785859bea014f10caf0e159

Comment by Kaloian Manassiev [ 15/Feb/17 ]

Hi glajchs,

The deprecation of SCCC in 3.2 is in the release nodes for sharding.

However it looks like the bug which prevents SlaveOK reads against config/admin databases seems to have been introduced in 3.2, because I just checked and it is not reproducible in 3.0. I am going to reopen this ticket again so we can evaluate whether there is a simpler fix than that for SERVER-25445.

Thank you for bringing this to our attention and sorry for the inconvenience.

Best regards,
-Kal.

Comment by Scott Glajch [ 14/Feb/17 ]

This seems like a big lack in functionality for running 3.2 SCCC, and since the 3.0 -> 3.2 upgrade procedure has you running in SCCC mode for all of it, and then optionally upgrading to CSRS afterwords, this is a very relevant thing to put into the upgrade docs, which it isn't.
Nowhere in the 3.0->3.2 upgrade docs does it mention that SCCC is deprecated, or that there will be functionality lacking until you upgrade to CSRS.

We are in the middle of upgrading to 3.2 and just now noticed that the parts of our software that query off of the config/admin databases is broken, which breaks further code paths we have.
Please add some warnings to the 3.2 upgrade documentation and be very explicit about what will not function without CSRS upgrade.

Or fix this code path, but it sounds like that might be off the plate due to refactoring.

Comment by Kaloian Manassiev [ 03/Feb/17 ]

Thanks yonido for figuring out the command. It appears that slaveOK reads against the SCCC config server also lead to an incorrect help command to be constructed.

We looked into the possibility of fixing it by introducing extra parsing logic exclusively for SCCC, but the change is not trivial and there is risk of missing some corner cases. Given that this has relatively low impact and because SCCC is deprecated in 3.2 and completely removed in 3.4, we won't be fixing it.

Once again I would like to urge you to upgrade to CSRS, which treats shards and the config server uniformly and you will not experience this error.

Best regards,
-Kal.

Comment by Yoni Douek [ 02/Feb/17 ]

Thanks for the input Kal, which helped us find the root cause.
We are using InfluxDB's telegraf to gather mongodb stats. It issues a count query to count the number of jumbo chunks, using secondary read preference - and this results in an error.

So please note that this is not limited to "explain" only, you can easily reproduce it by running this:

db.getMongo().setReadPref('secondaryPreferred')
db.chunks.count()

Comment by Kaloian Manassiev [ 01/Feb/17 ]

To give you some context, the code path which results in this error is specific to SCCC and is not present in CSRS. It runs "help" for each command it hasn't heard about and based on the field "lockType" returned by the help command makes a conclusion whether it is a write or read. If a command is deemed "write" it must be sent to all 3 nodes, if it is "read" only one is sufficient. For example:

> db.adminCommand({ createIndexes: 'TestDB', help: 1 });
{
	"help" : "help for: createIndexes no help defined",
	"lockType" : 0,
	"ok" : 1
}

As you can imagine, this is very error prone and from SERVER-25445 is evident that some times the help command is not constructed correctly and causes a confusing message to be returned. Our understanding was that only the "explain" path against a config server experiences it, which is not severe enough to backport a fix for 3.2. This problem will not exist after upgrading to CSRS, because the code path will just not be executed.

Our mongos became unavailable spontaneously. In addition, the error in SERVER-25445 seems different.

Something must have changed at your client's side which is now sending a different command than it used to. The "help" command I am describing above is constructed purely from the command input and there is no persistent state associated with how it is constructed.

Would it be possible to temporarily bump the sharding verbosity level to 3 on one of the affected mongos hosts using this command and upload the log file again?

db.adminCommand({ setParameter: 1, logComponentVerbosity: { network: 3 } });

Thanks in advance.

-Kal.

Comment by Yoni Douek [ 01/Feb/17 ]

Thanks for the reply. However, this is far from assuring:

1. We did not run any explain() command. Our mongos became unavailable spontaneously. In addition, the error in SERVER-25445 seems different.

2. "Decided not to fix" - don't you think that an issue that an irreversible spontaneous cluster crash is worth fixing? Your docs mention that mirrored config servers are deprecated in 3.4 - but what about clusters running 3.2 which is still officially supported?? This approach is unacceptable.

3. Whether we decide to upgrade to CSRS or not (which requires thorough testing on our end), we would like to know:

  • How come this state is irreversible? Where is the data that cause this error for specific mongo stored and how can we modify it?
  • Upgrading to CSRS may prevent this issue, but we would first want to revert it for the mongos that are in the erroneous state. What should we do in order to get them to work again? Upgrading to CSRS in this zombie state seems like a wrong move.
Comment by Kaloian Manassiev [ 31/Jan/17 ]

Hi yonido,

The error that you are seeing happens when you run explain on a collection in the config or admin databases, which reside on the config server. It can only happen when you are running the legacy SCCC config server setup.

It has been reported previously (SERVER-25445), but because SCCC is deprecated and because it only happens for explain on config/admin databases we decided not to fix it.

To resolve this issue, we recommend upgrading to CSRS.

Best regards,
-Kal.

Comment by Yoni Douek [ 31/Jan/17 ]

Thanks for helping so quickly.

No mongodump whatsoever.

Logs attached. Started happening in 2017-01-29T12:00.

Not sure it's related, but in that time we have a process which tries to split jumbo chunks (as unfortunately the inner mechanisms of mongodb don't do it so reliably so we have to do it ourselves - calling splitChunk). But this is running nightly for years with no issues. Maybe that can be related somehow.

Again - the mongos server is still available in this state if u need anything from it.

Comment by Ramon Fernandez Marina [ 31/Jan/17 ]

Hi yonido, sorry this is happening on your deployment. As you've already found, this behavior was internally reported once before in TOOLS-1010, but we were not able to reproduce. I guess the first question is whether you were using mongodump at the time and which version.

Can you please also upload the logs for one of the affected mongos and the config servers? I've created a private, secure upload portal so your logs are not public.

Thanks,
Ramón.

Generated at Thu Feb 08 04:16:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.