[SERVER-26859] AsyncResultsMerger replica set retargeting may block the ASIO callback threads Created: 01/Nov/16 Updated: 11/Apr/17 Resolved: 08/Nov/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.2.10, 3.4.0-rc2 |
| Fix Version/s: | 3.2.11, 3.4.0-rc3 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | Randolph Tan |
| Resolution: | Done | Votes: | 0 |
| Labels: | code-and-test | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Backport Completed: | |||||||||||||
| Sprint: | Sharding 2016-11-21 | ||||||||||||
| Participants: | |||||||||||||
| Case: | (copied to CRM) | ||||||||||||
| Description |
|
The AsyncResultsMerger performs retargeting on network or replication NotMaster errors, which occur during the initial cursor establishment. This retargeting is blocking and may happen on an ASIO callback thread and thus block it from processing other events, such as finishing connection establishment. This in turn can lead to connections unrelated to the request which triggered retargeting to become wrongly labeled as timed-out. The end effect of this is requests failing with an error of "ExceededTimeLimit: Operation timed out". What exacerbates this problem is that ASIO will throw out the entire pool for a host with timed-out connections, which will cause new connections to be opened. |
| Comments |
| Comment by Githook User [ 08/Nov/16 ] |
|
Author: {u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}Message: When the handleResponse callback encounters a retriable error. Signal the merger thread for it to retry instead of trying to reschedule inline since rescheduling involves re-evaluating the target host which is a blocking operation. (cherry picked from commit 5b2134f4ae4ea2d70b0ce89041fd11fd7810e40d) Conflicts: |
| Comment by Githook User [ 08/Nov/16 ] |
|
Author: {u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}Message: When the handleResponse callback encounters a retriable error. Signal the merger thread for it to retry instead of trying to reschedule inline since rescheduling involves re-evaluating the target host which is a blocking operation. |
| Comment by Jon Hyman [ 06/Nov/16 ] |
|
Once this is backported, can you please release 3.2.11 asap? We're stuck dealing with segfaults ( |
| Comment by Kaloian Manassiev [ 01/Nov/16 ] |
|
Making the ReplicaSetMonitor asynchronous is a significant task and doing this resolution on a separate thread may instantiate an unbounded number of threads in the system. The least disruptive change would be to instead signal the AsyncResultsMerger's work available event, without returning any results and get the user thread to perform the blocking search for read preference. |