[SERVER-22620] Improve mongos handling of a very stale secondary config server Created: 16/Feb/16 Updated: 03/Jan/18 Resolved: 09/Aug/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Sharding |
| Affects Version/s: | 3.2.1 |
| Fix Version/s: | 3.3.11 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Dmitry Ryabtsev | Assignee: | Misha Tyulenev |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Steps To Reproduce: |
|
||||||||||||||||
| Sprint: | Sharding 18 (08/05/16), Sharding 2016-08-29 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||
| Description |
|
This ticket is to improve the sharding handling of very stale secondary config servers (although it would apply to shards as well). The proposed solution is for the isMaster response to include the latest optime it has replicated to, so that the replica set monitor, in addition to selecting 'nearer' hosts will also prefer those with most recent optimes. This problem is also present in the case of fsyncLocked secondaries. It seems that mongos is unable to work properly if one of the config servers (RS) secondaries is locked with db.fsyncLock(). I have tried running some write concern / read concern operations directly on the replica set while a secondary is locked that way and found no problem. Thus it must be the problem of the mongos alone. |
| Comments |
| Comment by Ramon Fernandez Marina [ 14/Apr/17 ] |
|
venkata.surapaneni@elastica.co, this ticket has not been considered for backporting to v3.2. If this is an issue for you I'd suggest you consider an upgrade to MongoDB 3.4, which does contain a fix for this problem. |
| Comment by VenkataRamaRao Surapaneni [ 13/Apr/17 ] |
|
Is this issue fixed in 3.2.12 release? |
| Comment by Pooja Gupta (Inactive) [ 03/Apr/17 ] |
|
misha.tyulenev I believe this fix has been included in MongoDB 3.4. Is it backported to version 3.2 as well? |
| Comment by Githook User [ 09/Aug/16 ] |
|
Author: {u'username': u'mikety', u'name': u'Misha Tyulenev', u'email': u'misha@mongodb.com'}Message: |
| Comment by Kaloian Manassiev [ 18/Feb/16 ] |
|
I am re-purposing this ticket to cover our handling of very stale secondary config servers (although it would apply to shards as well). The proposed solution is for the isMaster response to include the latest optime it has replicated to, so that the replica set monitor, in addition to selecting 'nearer' hosts will also prefer those with most recent optimes. |
| Comment by Kaloian Manassiev [ 16/Feb/16 ] |
|
At the very least we should make the ShardRegistry mark hosts where the operations timeout as faulty. This will ensure that on a second attempt of the operation, the fsync locked host will not be contacted again. However, what Matt suggests is a better solution, even though it's more involved. We can return FailedToSatisfyReadPreference if the read concern is not satisfied, just before we begin waiting in replication_coordinator_impl.cpp. |
| Comment by Matt Dannenberg [ 16/Feb/16 ] |
|
I believe what is happening here is the mongos is contacting the locked secondary and that node is unable to satisfy the read concern and respond. Ideally the mongos (or another driver) would know the secondary is fsync-locked and not attempt to contact it. Maybe return an error (NotMasterOrSecondary? a new fsync one?) if queried with readConcern: majority while fsync-locked. Maybe fall into a new replSet state indicating the node is fsync-locked and should not be contacted in the first place. |