[CDRIVER-1956] Topology scanner's SSL handshake is blocking Created: 15/Dec/16 Updated: 15/Nov/18 Resolved: 24/Jan/18 |
|
| Status: | Closed |
| Project: | C Driver |
| Component/s: | tls |
| Affects Version/s: | None |
| Fix Version/s: | 1.10.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | A. Jesse Jiryu Davis | Assignee: | A. Jesse Jiryu Davis |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
The topology scanner should fan out to all servers and check them all concurrently using non-blocking I/O. However, our implementation of the TLS handshake operation blocks, waiting for the initial connection to complete and to reach a certain step in the TLS protocol. This means that high-latency replica set members slow down the topology scanner more than expected. Test: cd to the mongo-c-driver root dir and start a mongod:
Update mongoc_stream_tls_openssl_handshake:
Slow down the network, on Linux:
Or on Mac:
Ignore the warnings about "No ALTQ support in kernel", etc. The shell should now connect, slowly:
Now recompile and run a test:
Listing "localhost:27017" twice lets us see if the topology scanner begins both handshakes concurrently and then both succeed (as expected) or if it begins and completes one handshake, then the other handshake (the bug). In fact, this is what I see with OpenSSL 1.0.1f on Ubuntu 16.04:
There are two symptoms of the blocking handshake. First, we see one handshake begin, block for two seconds, then succeed, before the other begins. Second, we expect the function to print "handshake should retry" but it doesn't. This blocking behavior is even seen if we add this, although it blocks for a shorter duration:
I've made some attempts to fix this like so, with no effect:
There is a reference to a bug with this function and BIO_do_handshake from 15 years ago and another from 9 years ago that I do not believe can apply to OpenSSL 1.0.1. I also tried deleting this line from _mongoc_openssl_ctx_new:
Also no effect on the topology scanner. Return your network to normalcy, on Linux:
Or on Mac:
Update now resolved for OpenSSL and Windows Secure Channel implementations. The handshake is now parallelized by setting the timeout value to 0 during handshake. Additionally, I've increased throughput with SChannel by increasing the initial receive buffer size, and perhaps fixed a latent bug in the previous code that could cause a deadlock if the buffer grew large enough to receive multiple SSL blocks. The Apple Secure Transport implement still handshakes connections serially, but I'm not fixing this at the moment. Mac OS X is used for development of C Driver applications, not for production. |
| Comments |
| Comment by Githook User [ 24/Jan/18 ] | |||||||||||||||||||||||||||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: The driver now parallelizes its initial TLS connections with all hosts This change also avoids what appears to be a potential timeout when Finally, we now pre-allocate 17k receive buffers large enough to handle | |||||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 05/Jan/18 ] | |||||||||||||||||||||||||||||||
|
I've fixed OpenSSL. Secure Transport & Secure Channel are next. | |||||||||||||||||||||||||||||||
| Comment by Githook User [ 05/Jan/18 ] | |||||||||||||||||||||||||||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis', 'email': 'jesse@mongodb.com'}Message: | |||||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 02/Mar/17 ] | |||||||||||||||||||||||||||||||
|
I've prototyped a solution for OpenSSL but got stuck on adapting it to Apple's Secure Transport. There, if I set the read timeout to 0 and my read callback returns "errSSLWouldBlock" during the SSLHandshake call, the handshake fails with error -9806, "errSSLClosedAbort: The connection closed due to an error." I can't figure out why my approach doesn't work, I've asked Apple for help but they are slow to answer. I haven't yet attempted Microsoft's Secure Channel. | |||||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 15/Dec/16 ] | |||||||||||||||||||||||||||||||
|
The problems start when mongoc_async_cmd_tls_setup passes its timeout (default 10 seconds, the connectTimeoutMS) to mongoc_stream_tls_handshake, which sets mongoc_stream_tls_t's timeout. Then in mongoc_stream_tls_openssl_handshake we call BIO_do_handshake, which calls down through the OpenSSL code until it wants some number of bytes, and calls back to us in mongoc_stream_tls_openssl_bio_read. There, we retrieve the 10-second timeout from the mongoc_stream_tls_t and pass it to mongoc_stream_read, which blocks until it's read the number of bytes OpenSSL wants, or until it times out. OpenSSL continues to call our mongoc_stream_tls_openssl_bio_read until the handshake is complete. If we set the timeout to 0 in mongoc_stream_tls_handshake instead, that makes the topology scanner concurrent. We also need a fix in mongoc_stream_tls_openssl_bio_read:
That change is necessary because somehow the BIO that OpenSSL passes to us is not the BIO we passed to OpenSSL. With this in place, we have the beginning of a truly concurrent topology scanner with TLS. In single-threaded mode, setting the mongoc_stream_tls_t's timeout to 0 might interfere with application operations, since the client reuses the same streams for topology scans and for operations. We'll need to check on that. I'm now pretty sure the same bug applies to all TLS implementations, not just OpenSSL. |