[SERVER-21854] NetworkInterface ASIO skips timing out requests on OSX Created: 10/Dec/15  Updated: 08/Jan/24  Resolved: 14/Dec/16

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: 3.2.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Samantha Ritter (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File tests.log.gz    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

On an OSX build (doesn't matter if debug or release):

scons --dbg=on --opt=off --allocator=system -j8 --osx-version-min=10.7 --libc++ --mute mongod mongos mongo mongobridge
 
python buildscripts/resmoke.py --repeat=50 --log=file --executor=sharding --storageEngine=wiredTiger jstests/sharding/all_config_servers_blackholed_from_mongos.js

Sprint: Platforms 10 (02/19/16), Platforms 12 (04/01/16), Platforms 13 (04/22/16), Platforms 14 (05/13/16), Platforms 15 (06/03/16), Platforms 16 (06/24/16), Platforms 17 (07/15/16), Platforms 18 (08/05/16), Platforms 2016-08-26, Platforms 2016-09-19, Platforms 2016-10-10, Platforms 2016-11-21, Platforms 2017-01-23
Participants:

 Description   

This bug only seems to be reproducible on OSX and is most easily visible when running the 'blackhole' test, where mongobridge drops all requests against the config servers.

From inspecting the logs, it seems that occasionally network requests with timeout do not get timed out and just sit there.

I am attaching the complete logs, but here is the relevant snippets (with verbosity 3 and some extra tracing of the request ids). From the snippet below it can be seen that request 151 never times out, even though it has an expiration time of 30 seconds after the request was scheduled (2015-12-10T16:29:59.426-0500):

[js_test:all_config_servers_blackholed_from_mongos] 2015-12-10T16:29:29.426-0500 s20020| 2015-12-10T16:29:29.425-0500 D SHARDING [conn2] command: NonExistentDB.$cmd { insert: "TestColl", documents: [ { _id: 0.0, value: "This value will never be inserted" } ], ordered: true } ntoreturn: 1 options: 0
[js_test:all_config_servers_blackholed_from_mongos] 2015-12-10T16:29:29.426-0500 s20020| 2015-12-10T16:29:29.426-0500 D EXECUTOR [conn2] Scheduling remote command request: RemoteCommand 151 -- target:kaloianm-mac.local:20016 db:config expDate:2015-12-10T16:29:59.426-0500 cmd:{ find: "databases", filter: { _id: "NonExistentDB" }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1449782939000|15, t: 1 } }, limit: 1, maxTimeMS: 30000 }
[js_test:all_config_servers_blackholed_from_mongos] 2015-12-10T16:29:29.426-0500 s20020| 2015-12-10T16:29:29.426-0500 D ASIO     [conn2] startCommand: RemoteCommand 151 -- target:kaloianm-mac.local:20016 db:config expDate:2015-12-10T16:29:59.426-0500 cmd:{ find: "databases", filter: { _id: "NonExistentDB" }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1449782939000|15, t: 1 } }, limit: 1, maxTimeMS: 30000 }
[js_test:all_config_servers_blackholed_from_mongos] 2015-12-10T16:29:29.426-0500 s20020| 2015-12-10T16:29:29.426-0500 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Connecting to kaloianm-mac.local:20016
[js_test:all_config_servers_blackholed_from_mongos] 2015-12-10T16:29:29.427-0500 b20016| 2015-12-10T16:29:29.427-0500 I NETWORK  [main] connection accepted from 10.4.126.94:65167 #15 (1 connection now open)
[js_test:all_config_servers_blackholed_from_mongos] 2015-12-10T16:29:29.428-0500 s20020| 2015-12-10T16:29:29.427-0500 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Starting asynchronous command 152 on host kaloianm-mac.local:20016
[js_test:all_config_servers_blackholed_from_mongos] 2015-12-10T16:29:29.428-0500 c20019| 2015-12-10T16:29:29.428-0500 I NETWORK  [initandlisten] connection accepted from 10.4.126.94:65168 #16 (13 connections now open)
DISCARDED --> [js_test:all_config_servers_blackholed_from_mongos] 2015-12-10T16:29:29.428-0500 b20016| 2015-12-10T16:29:29.428-0500 I BRIDGE   [thread6] Discarding "isMaster" command with arguments { isMaster: 1, hostInfo: "kaloianm-mac.local:20020" } from kaloianm-mac.local:20020
...
DIFFERENT TIMED OUT REQUEST --> [js_test:all_config_servers_blackholed_from_mongos] 2015-12-10T16:29:58.845-0500 s20020| 2015-12-10T16:29:58.845-0500 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Operation timed out: RemoteCommand 149 -- target:kaloianm-mac.local:20016 db:admin expDate:2015-12-10T16:29:58.838-0500 cmd:{ _getUserCacheGeneration: 1, maxTimeMS: 30000 }



 Comments   
Comment by Samantha Ritter (Inactive) [ 14/Dec/16 ]

Confirmed that this was not a reproducible bug, thanks Kal.

Comment by Kaloian Manassiev [ 14/Dec/16 ]

I also confirm that I have not been able to reproduce it using the repro steps above, so I am fine with closing it as 'Gone away'.

Comment by Samantha Ritter (Inactive) [ 13/Dec/16 ]

I've been attempting to reproduce the error reported in this ticket, and I cannot.

There are two "phases" during which a request run through startCommand can time out. In phase 1, "connect," we open and authenticate a connection on which to run the command. In phase 2, we run the actual thing passed into startCommand. From the logs, it is clear that we never entered phase 2 on operation 151, the one we claim didn't time out.

The logging line that was added reports the original request id, which we do not have in phase 1. When we are in the connection pool, we run with a dummy AsyncOp that doesn't have a RemoteCommandRequest set on it. There would be no way of logging the request Id or the command from within phase 1, because we simply do not have access to it. So, the fact that you are logging "Operation timed out: RemoteCommand 148" tells me that you must be logging this from code in phase 2 (and I'm guessing that you put that log line into the callback for the timer that we call asyncWait on from within startCommand?)

The logging that kaloian.manassiev added to trace this wasn't sufficient. In short, just because we haven't seen "Operation timed out: RemoteCommand 151" doesn't mean that request 151 didn't time out; it means it timed out in phase 1, and we didn't log there.

Here is a more in-depth description, with supporting log lines:

Request is passed into startCommand (it's a "find databases" command)

s20020| 2015-12-10T16:29:29.426-0500 D ASIO     [conn2] startCommand: RemoteCommand 151 -- target:kaloianm-mac.local:20016 db:config expDate:2015-12-10T16:29:59.426-0500 cmd:{ find: "databases", filter: { _id: "NonExistentDB" }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1449782939000|15, t: 1 } }, limit: 1, maxTimeMS: 30000 }

phase 1: connecting

  • operation is stashed in _inGetConnection
  • connection pool opens a new connection

    s20020| 2015-12-10T16:29:29.426-0500 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Connecting to kaloianm-mac.local:20016
    

  • we send an 'ismaster' request as part of our auth process

    s20020| 2015-12-10T16:29:29.427-0500 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Starting asynchronous command 152 on host kaloianm-mac.local:20016
    

  • this ismaster is discarded by the remote host

    b20016| 2015-12-10T16:29:29.428-0500 I BRIDGE   [thread6] Discarding "isMaster" command with arguments { isMaster: 1, hostInfo: "kaloianm-mac.local:20020" } from kaloianm-mac.local:20020
    

phase 2: running (we don't get here)

  • start a timer for the new operation (I believe the logging was added here)
  • send and receive from the remote host as-needed until complete or timed out

I have also tried to reproduce this bug in other ways. We maintain two sets of operations, _inGetConnection (for the ones connecting in phase 1) and _inProgress (for the ones that have connected and gotten to phase 2). Working on 3.2, I added some code to the NetworkInterface's shutdown method to wait for those two queues to drain before shutting down. If we failed to time something out, its request would be waiting in one of those queues, and the server should have hung. After running this test over 3000 times against my modified server, I have not seen any evidence of tasks lingering in either queue and failing to time out.

Since there isn't enough evidence here to confirm that this ever occurred, and I was not able to see it occur in many runs with a modified server, I am going to close this ticket as "gone away"

Comment by Githook User [ 11/Dec/15 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-21854 Add request id to RemoteCommandRequest

Having identifier associated with each request allows us to trace requests
throughout the ASIO network threads.

This change doesn't fix the bug, but makes investigation easier.
Branch: v3.2
https://github.com/mongodb/mongo/commit/e7f86df75bca453b1e6684ced59bf79b388330de

Comment by Githook User [ 11/Dec/15 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-21854 Add request id to RemoteCommandRequest

Having identifier associated with each request allows us to trace requests
throughout the ASIO network threads.

This change doesn't fix the bug, but makes investigation easier.
Branch: master
https://github.com/mongodb/mongo/commit/e57b47d8eaf8267881fe84da827f1f5b7409f341

Generated at Thu Feb 08 03:58:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.