[CDRIVER-1956] Topology scanner's SSL handshake is blocking Created: 15/Dec/16  Updated: 15/Nov/18  Resolved: 24/Jan/18

Status: Closed
Project: C Driver
Component/s: tls
Affects Version/s: None
Fix Version/s: 1.10.0

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: A. Jesse Jiryu Davis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by CDRIVER-1409 Test that topology scanner is still a... Closed
Related
is related to CDRIVER-2394 /Topology/invalidate_server/ tests ar... Closed
is related to CDRIVER-2885 Topology scanner's SSL handshake is b... Backlog

 Description   

The topology scanner should fan out to all servers and check them all concurrently using non-blocking I/O. However, our implementation of the TLS handshake operation blocks, waiting for the initial connection to complete and to reach a certain step in the TLS protocol. This means that high-latency replica set members slow down the topology scanner more than expected.

Test:

cd to the mongo-c-driver root dir and start a mongod:

mongod --sslOnNormalPorts --sslPEMKeyFile tests/x509gen/server.pem --sslCAFile     tests/x509gen/ca.pem

Update mongoc_stream_tls_openssl_handshake:

   time_t time_ptr;
   time_ptr = time(NULL); printf ("start handshake %s", ctime (&time_ptr));
   if (BIO_do_handshake (openssl->bio) == 1) {
      time_ptr = time(NULL); printf ("handshake succeeds %s", ctime (&time_ptr));
      if (_mongoc_openssl_check_cert (
             ssl, host, tls->ssl_opts.allow_invalid_hostname)) {
         RETURN (true);
      }
 
      *events = 0;
      bson_set_error (error,
                      MONGOC_ERROR_STREAM,
                      MONGOC_ERROR_STREAM_SOCKET,
                      "TLS handshake failed: Failed certificate verification");
 
      RETURN (false);
   }
 
   if (BIO_should_retry (openssl->bio)) {
      time_ptr = time(NULL); printf ("handshake should retry %s", ctime (&time_ptr));
      *events = BIO_should_read (openssl->bio) ? POLLIN : POLLOUT;
      RETURN (false);
   }

Slow down the network, on Linux:

sudo tc qdisc add dev lo root netem delay 300ms

Or on Mac:

sudo pfctl -E
(cat /etc/pf.conf && echo "dummynet-anchor \"foo\"" && echo "anchor \"foo\"") | sudo pfctl -f -
echo "dummynet in quick proto tcp from any to any port 27017 pipe 1" | sudo pfctl -a foo -f -
sudo dnctl pipe 1 config bw 20000bit/s

Ignore the warnings about "No ALTQ support in kernel", etc.

The shell should now connect, slowly:

mongo --ssl --sslPEMKeyFile tests/x509gen/client.pem --sslCAFile tests/x509gen/ca.pem --host localhost

Now recompile and run a test:

export MONGOC_TEST_URI=mongodb://localhost:27017,localhost:27017 
export MONGOC_TEST_SSL_PEM_FILE=tests/x509gen/client.pem 
export MONGOC_TEST_SSL_CA_FILE=tests/x509gen/ca.pem
./test-libmongoc --no-fork -l /Client/select_server/single

Listing "localhost:27017" twice lets us see if the topology scanner begins both handshakes concurrently and then both succeed (as expected) or if it begins and completes one handshake, then the other handshake (the bug). In fact, this is what I see with OpenSSL 1.0.1f on Ubuntu 16.04:

start handshake Thu Dec 15 02:37:12 2016
handshake succeeds Thu Dec 15 02:37:14 2016
start handshake Thu Dec 15 02:37:16 2016
handshake succeeds Thu Dec 15 02:37:17 2016

There are two symptoms of the blocking handshake. First, we see one handshake begin, block for two seconds, then succeed, before the other begins. Second, we expect the function to print "handshake should retry" but it doesn't.

This blocking behavior is even seen if we add this, although it blocks for a shorter duration:

export MONGOC_TEST_SSL_WEAK_CERT_VALIDATION=on

I've made some attempts to fix this like so, with no effect:

BIO_set_nbio (openssl->bio, 1);

There is a reference to a bug with this function and BIO_do_handshake from 15 years ago and another from 9 years ago that I do not believe can apply to OpenSSL 1.0.1.

I also tried deleting this line from _mongoc_openssl_ctx_new:

SSL_CTX_set_mode (ctx, SSL_MODE_AUTO_RETRY);

Also no effect on the topology scanner.

Return your network to normalcy, on Linux:

sudo tc qdisc del dev lo root

Or on Mac:

sudo dnctl -f flush
sudo pfctl -f /etc/pf.conf

Update now resolved for OpenSSL and Windows Secure Channel implementations. The handshake is now parallelized by setting the timeout value to 0 during handshake. Additionally, I've increased throughput with SChannel by increasing the initial receive buffer size, and perhaps fixed a latent bug in the previous code that could cause a deadlock if the buffer grew large enough to receive multiple SSL blocks.

The Apple Secure Transport implement still handshakes connections serially, but I'm not fixing this at the moment. Mac OS X is used for development of C Driver applications, not for production.



 Comments   
Comment by Githook User [ 24/Jan/18 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: CDRIVER-1956 nonblocking secure channel handshake

The driver now parallelizes its initial TLS connections with all hosts
in the URI, when using native Windows TLS. OpenSSL parallel handshakes
have already been implemented, but native Apple TLS (Secure Transport)
is still blocking.

This change also avoids what appears to be a potential timeout when
reading a small chunk of data from the server after reading a large one:
if the receive buffer had grown large enough to receive an entire reply
but the driver had only read part of it, the driver would subsequently
try to read from the socket instead of reading from the buffer, even
though the server had already sent its whole message.

Finally, we now pre-allocate 17k receive buffers large enough to handle
a whole encrypted block, avoiding small reads and small reallocs.
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/0e02389c9b2aed263d50aeb25f6755b1bf08a905

Comment by A. Jesse Jiryu Davis [ 05/Jan/18 ]

I've fixed OpenSSL. Secure Transport & Secure Channel are next.

Comment by Githook User [ 05/Jan/18 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis', 'email': 'jesse@mongodb.com'}

Message: CDRIVER-1956 async openssl handshake
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/f431967858e6a2a9278176c9b5bb77e2f02a04ad

Comment by A. Jesse Jiryu Davis [ 02/Mar/17 ]

I've prototyped a solution for OpenSSL but got stuck on adapting it to Apple's Secure Transport. There, if I set the read timeout to 0 and my read callback returns "errSSLWouldBlock" during the SSLHandshake call, the handshake fails with error -9806, "errSSLClosedAbort: The connection closed due to an error." I can't figure out why my approach doesn't work, I've asked Apple for help but they are slow to answer.

I haven't yet attempted Microsoft's Secure Channel.

Comment by A. Jesse Jiryu Davis [ 15/Dec/16 ]

The problems start when mongoc_async_cmd_tls_setup passes its timeout (default 10 seconds, the connectTimeoutMS) to mongoc_stream_tls_handshake, which sets mongoc_stream_tls_t's timeout.

Then in mongoc_stream_tls_openssl_handshake we call BIO_do_handshake, which calls down through the OpenSSL code until it wants some number of bytes, and calls back to us in mongoc_stream_tls_openssl_bio_read. There, we retrieve the 10-second timeout from the mongoc_stream_tls_t and pass it to mongoc_stream_read, which blocks until it's read the number of bytes OpenSSL wants, or until it times out. OpenSSL continues to call our mongoc_stream_tls_openssl_bio_read until the handshake is complete.

If we set the timeout to 0 in mongoc_stream_tls_handshake instead, that makes the topology scanner concurrent. We also need a fix in mongoc_stream_tls_openssl_bio_read:

 int
 mongoc_stream_tls_openssl_bio_read (BIO *b, char *buf, int len)
 {
    mongoc_stream_tls_t *tls;
+   mongoc_stream_tls_openssl_t *openssl;
    int ret;
 
    BSON_ASSERT (b);
    BSON_ASSERT (buf);
    ENTRY;
 
    tls = (mongoc_stream_tls_t *) BIO_get_data (b);
 
    if (!tls) {
       RETURN (-1);
    }
 
+   openssl = (mongoc_stream_tls_openssl_t *) tls->ctx;
+
    errno = 0;
    ret = (int) mongoc_stream_read (
       tls->base_stream, buf, len, 0, tls->timeout_msec);
    BIO_clear_retry_flags (b);
 
    if ((ret <= 0) && MONGOC_ERRNO_IS_AGAIN (errno)) {
-      BIO_set_retry_read (b);
+      BIO_set_retry_read (openssl->bio);
    }
 
    RETURN (ret);
 }

That change is necessary because somehow the BIO that OpenSSL passes to us is not the BIO we passed to OpenSSL. With this in place, we have the beginning of a truly concurrent topology scanner with TLS.

In single-threaded mode, setting the mongoc_stream_tls_t's timeout to 0 might interfere with application operations, since the client reuses the same streams for topology scans and for operations. We'll need to check on that.

I'm now pretty sure the same bug applies to all TLS implementations, not just OpenSSL.

Generated at Wed Feb 07 21:13:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.