[SERVER-23192] mongos and shards will become unusable if contact is lost with all CSRS config server nodes for more than 30 consecutive failed attempts to contact Created: 16/Mar/16 Updated: 20/Sep/18 Resolved: 01/Aug/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.2.4 |
| Fix Version/s: | 3.3.11 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | Misha Tyulenev |
| Resolution: | Done | Votes: | 5 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Minor Change | ||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: | See the comment in jstests/sharding/startup_with_all_configs_down.js. |
||||||||||||||||||||||||||||||||||||||||
| Sprint: | Sharding 16 (06/24/16), Sharding 18 (08/05/16) | ||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
Issue Status as of Oct 07, 2016 ISSUE DESCRIPTION AND IMPACT DIAGNOSIS AND AFFECTED VERSIONS Operations that require access config server metadata will begin failing with the following error:
REMEDIATION AND WORKAROUNDS This issue has been fixed in MongoDB 3.4.0 and 3.2.10:
On versions prior to MongoDB 3.2.10, this issue can be avoided by executing the following command at runtime on all mongos and mongod nodes:
Please note that this parameter does not persist and must be set each time the node restarts. Original descriptionIf mongos loses network contact with all nodes from the CSRS config server set (both primary and secondaries), the replica set monitor will deem this set as as 'unusable' and will stop monitoring it. From this point onward all operations which need to access some config server metadata will begin failing with the following error:
This includes the refresh of the list of shards, which needs to be read from the config server metadata. Therefore, currently there is no procedure to restart or retry monitoring of the config server set and the only recourse is to restart mongos. |
| Comments |
| Comment by Kay Kim (Inactive) [ 19/Jul/17 ] | |||
|
Thanks much. Will finally doc this ticket all 9000 years after the fact | |||
| Comment by Misha Tyulenev [ 19/Jul/17 ] | |||
|
kay.kim IMO config databases can be misinterpreted - may be better to say "CSRS condfig server nodes" | |||
| Comment by Misha Tyulenev [ 09/Dec/16 ] | |||
|
akira.kurogane, Thanks for the update. | |||
| Comment by Spencer Brody (Inactive) [ 26/Aug/16 ] | |||
|
We decided that backporting this to 3.2 is not feasible due to how much the code has changed since then, and decided to do | |||
| Comment by Spencer Brody (Inactive) [ 02/Aug/16 ] | |||
|
Adding to "Needs Triage" specifically to discuss if we need to do a separate, smaller-scope fix for 3.2 | |||
| Comment by Misha Tyulenev [ 01/Aug/16 ] | |||
|
The change has removed the replMonitorMaxFailedChecks parameter as its not applicable anymore. The fix makes replica state monitoring unbounded. | |||
| Comment by Githook User [ 01/Aug/16 ] | |||
|
Author: {u'username': u'mikety', u'name': u'Misha Tyulenev', u'email': u'misha@mongodb.com'}Message: | |||
| Comment by Githook User [ 25/Jul/16 ] | |||
|
Author: {u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@mongodb.com'}Message: | |||
| Comment by Kaloian Manassiev [ 15/Jul/16 ] | |||
|
This is correct. Thank you vlad.vintila@hootsuite.com for pointing out the incomplete title. We have updated the title to be more descriptive. | |||
| Comment by Vlad Vintila [ 15/Jul/16 ] | |||
|
So just to reiterate here, mongos AND mongod have this issue, and they all need to be restarted. The title should be updated. We've had all config servers down for more than 30s, and restarting mongos resulted in writes being accepted, but reads were still giving the error. Reads started working once we restarted ALL mongod servers(shards) of our affected cluster. | |||
| Comment by riccardo salzer [ 04/May/16 ] | |||
|
today we had a connection problem between two datacenters and one shard, which had no config server in the same datacenter, went unavailable for read & write requests
mongos error message while insert
even when the connection came back after 30min, mongod was still of the opinion that the config servers are unavailable. | |||
| Comment by Githook User [ 26/Apr/16 ] | |||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: While the config server is down, the shards' replica set monitor might | |||
| Comment by Kaloian Manassiev [ 17/Mar/16 ] | |||
|
Yes, this would be ideal, but just FYI - we won't be able to do it until we remove DBClientRS. Otherwise we will do duplicate monitoring. | |||
| Comment by Spencer Brody (Inactive) [ 16/Mar/16 ] | |||
|
I feel like the long-term fix is to tie the lifetime of the replica set monitor to the lifetime of the Shard existing in the ShardRegistry. Something to consider for the ShardRegistry refactoring. FYI misha.tyulenev | |||
| Comment by Kaloian Manassiev [ 16/Mar/16 ] | |||
|
One potential solution would be to extend the replica set monitor so it accepts a configurable upper bound on when to stop monitoring replica sets and special-case the CSRS replica set so we never give up on monitoring it. Similar problem exists for the shard hosts. However since we periodically (every 10 seconds) refresh the ShardRegistry, these will eventually start being monitored again. |