[SERVER-27864] SlaveOK reads against SCCC config server fail with error "all servers down/unreachable when querying" Created: 31/Jan/17 Updated: 06/Dec/17 Resolved: 14/Mar/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.2.10 |
| Fix Version/s: | 3.2.13 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Yoni Douek | Assignee: | Randolph Tan |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Sharding 2017-03-27 | ||||||||
| Participants: | |||||||||
| Case: | (copied to CRM) | ||||||||
| Description |
|
Performing 'SlaveOK' reads against an SCCC config server fails with the following error:
The reason is explained in this comment. These are the errors which show up in the log:
|
| Comments |
| Comment by Githook User [ 14/Mar/17 ] | |||||||
|
Author: {u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}Message: | |||||||
| Comment by Kaloian Manassiev [ 15/Feb/17 ] | |||||||
|
Hi glajchs, The deprecation of SCCC in 3.2 is in the release nodes for sharding. However it looks like the bug which prevents SlaveOK reads against config/admin databases seems to have been introduced in 3.2, because I just checked and it is not reproducible in 3.0. I am going to reopen this ticket again so we can evaluate whether there is a simpler fix than that for Thank you for bringing this to our attention and sorry for the inconvenience. Best regards, | |||||||
| Comment by Scott Glajch [ 14/Feb/17 ] | |||||||
|
This seems like a big lack in functionality for running 3.2 SCCC, and since the 3.0 -> 3.2 upgrade procedure has you running in SCCC mode for all of it, and then optionally upgrading to CSRS afterwords, this is a very relevant thing to put into the upgrade docs, which it isn't. We are in the middle of upgrading to 3.2 and just now noticed that the parts of our software that query off of the config/admin databases is broken, which breaks further code paths we have. Or fix this code path, but it sounds like that might be off the plate due to refactoring. | |||||||
| Comment by Kaloian Manassiev [ 03/Feb/17 ] | |||||||
|
Thanks yonido for figuring out the command. It appears that slaveOK reads against the SCCC config server also lead to an incorrect help command to be constructed. We looked into the possibility of fixing it by introducing extra parsing logic exclusively for SCCC, but the change is not trivial and there is risk of missing some corner cases. Given that this has relatively low impact and because SCCC is deprecated in 3.2 and completely removed in 3.4, we won't be fixing it. Once again I would like to urge you to upgrade to CSRS, which treats shards and the config server uniformly and you will not experience this error. Best regards, | |||||||
| Comment by Yoni Douek [ 02/Feb/17 ] | |||||||
|
Thanks for the input Kal, which helped us find the root cause. So please note that this is not limited to "explain" only, you can easily reproduce it by running this:
| |||||||
| Comment by Kaloian Manassiev [ 01/Feb/17 ] | |||||||
|
To give you some context, the code path which results in this error is specific to SCCC and is not present in CSRS. It runs "help" for each command it hasn't heard about and based on the field "lockType" returned by the help command makes a conclusion whether it is a write or read. If a command is deemed "write" it must be sent to all 3 nodes, if it is "read" only one is sufficient. For example:
As you can imagine, this is very error prone and from
Something must have changed at your client's side which is now sending a different command than it used to. The "help" command I am describing above is constructed purely from the command input and there is no persistent state associated with how it is constructed. Would it be possible to temporarily bump the sharding verbosity level to 3 on one of the affected mongos hosts using this command and upload the log file again?
Thanks in advance. -Kal. | |||||||
| Comment by Yoni Douek [ 01/Feb/17 ] | |||||||
|
Thanks for the reply. However, this is far from assuring: 1. We did not run any explain() command. Our mongos became unavailable spontaneously. In addition, the error in 2. "Decided not to fix" - don't you think that an issue that an irreversible spontaneous cluster crash is worth fixing? Your docs mention that mirrored config servers are deprecated in 3.4 - but what about clusters running 3.2 which is still officially supported?? This approach is unacceptable. 3. Whether we decide to upgrade to CSRS or not (which requires thorough testing on our end), we would like to know:
| |||||||
| Comment by Kaloian Manassiev [ 31/Jan/17 ] | |||||||
|
Hi yonido, The error that you are seeing happens when you run explain on a collection in the config or admin databases, which reside on the config server. It can only happen when you are running the legacy SCCC config server setup. It has been reported previously ( To resolve this issue, we recommend upgrading to CSRS. Best regards, | |||||||
| Comment by Yoni Douek [ 31/Jan/17 ] | |||||||
|
Thanks for helping so quickly. No mongodump whatsoever. Logs attached. Started happening in 2017-01-29T12:00. Not sure it's related, but in that time we have a process which tries to split jumbo chunks (as unfortunately the inner mechanisms of mongodb don't do it so reliably so we have to do it ourselves - calling splitChunk). But this is running nightly for years with no issues. Maybe that can be related somehow. Again - the mongos server is still available in this state if u need anything from it. | |||||||
| Comment by Ramon Fernandez Marina [ 31/Jan/17 ] | |||||||
|
Hi yonido, sorry this is happening on your deployment. As you've already found, this behavior was internally reported once before in Can you please also upload the logs for one of the affected mongos and the config servers? I've created a private, secure upload portal so your logs are not public. Thanks, |