[SERVER-21854] NetworkInterface ASIO skips timing out requests on OSX Created: 10/Dec/15 Updated: 08/Jan/24 Resolved: 14/Dec/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking |
| Affects Version/s: | 3.2.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | Samantha Ritter (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
|||
| Backwards Compatibility: | Fully Compatible | |||
| Operating System: | ALL | |||
| Steps To Reproduce: | On an OSX build (doesn't matter if debug or release):
|
|||
| Sprint: | Platforms 10 (02/19/16), Platforms 12 (04/01/16), Platforms 13 (04/22/16), Platforms 14 (05/13/16), Platforms 15 (06/03/16), Platforms 16 (06/24/16), Platforms 17 (07/15/16), Platforms 18 (08/05/16), Platforms 2016-08-26, Platforms 2016-09-19, Platforms 2016-10-10, Platforms 2016-11-21, Platforms 2017-01-23 | |||
| Participants: |
| Description |
|
This bug only seems to be reproducible on OSX and is most easily visible when running the 'blackhole' test, where mongobridge drops all requests against the config servers. From inspecting the logs, it seems that occasionally network requests with timeout do not get timed out and just sit there. I am attaching the complete logs, but here is the relevant snippets (with verbosity 3 and some extra tracing of the request ids). From the snippet below it can be seen that request 151 never times out, even though it has an expiration time of 30 seconds after the request was scheduled (2015-12-10T16:29:59.426-0500):
|
| Comments |
| Comment by Samantha Ritter (Inactive) [ 14/Dec/16 ] | ||||
|
Confirmed that this was not a reproducible bug, thanks Kal. | ||||
| Comment by Kaloian Manassiev [ 14/Dec/16 ] | ||||
|
I also confirm that I have not been able to reproduce it using the repro steps above, so I am fine with closing it as 'Gone away'. | ||||
| Comment by Samantha Ritter (Inactive) [ 13/Dec/16 ] | ||||
|
I've been attempting to reproduce the error reported in this ticket, and I cannot. There are two "phases" during which a request run through startCommand can time out. In phase 1, "connect," we open and authenticate a connection on which to run the command. In phase 2, we run the actual thing passed into startCommand. From the logs, it is clear that we never entered phase 2 on operation 151, the one we claim didn't time out. The logging line that was added reports the original request id, which we do not have in phase 1. When we are in the connection pool, we run with a dummy AsyncOp that doesn't have a RemoteCommandRequest set on it. There would be no way of logging the request Id or the command from within phase 1, because we simply do not have access to it. So, the fact that you are logging "Operation timed out: RemoteCommand 148" tells me that you must be logging this from code in phase 2 (and I'm guessing that you put that log line into the callback for the timer that we call asyncWait on from within startCommand?) The logging that kaloian.manassiev added to trace this wasn't sufficient. In short, just because we haven't seen "Operation timed out: RemoteCommand 151" doesn't mean that request 151 didn't time out; it means it timed out in phase 1, and we didn't log there. Here is a more in-depth description, with supporting log lines: Request is passed into startCommand (it's a "find databases" command)
phase 1: connecting
phase 2: running (we don't get here)
I have also tried to reproduce this bug in other ways. We maintain two sets of operations, _inGetConnection (for the ones connecting in phase 1) and _inProgress (for the ones that have connected and gotten to phase 2). Working on 3.2, I added some code to the NetworkInterface's shutdown method to wait for those two queues to drain before shutting down. If we failed to time something out, its request would be waiting in one of those queues, and the server should have hung. After running this test over 3000 times against my modified server, I have not seen any evidence of tasks lingering in either queue and failing to time out. Since there isn't enough evidence here to confirm that this ever occurred, and I was not able to see it occur in many runs with a modified server, I am going to close this ticket as "gone away" | ||||
| Comment by Githook User [ 11/Dec/15 ] | ||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: Having identifier associated with each request allows us to trace requests This change doesn't fix the bug, but makes investigation easier. | ||||
| Comment by Githook User [ 11/Dec/15 ] | ||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: Having identifier associated with each request allows us to trace requests This change doesn't fix the bug, but makes investigation easier. |