Uploaded image for project: 'C Driver'
  1. C Driver
  2. CDRIVER-1956

Topology scanner's SSL handshake is blocking

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 1.10.0
    • Affects Version/s: None
    • Component/s: tls
    • Labels:
      None

      The topology scanner should fan out to all servers and check them all concurrently using non-blocking I/O. However, our implementation of the TLS handshake operation blocks, waiting for the initial connection to complete and to reach a certain step in the TLS protocol. This means that high-latency replica set members slow down the topology scanner more than expected.

      Test:

      cd to the mongo-c-driver root dir and start a mongod:

      mongod --sslOnNormalPorts --sslPEMKeyFile tests/x509gen/server.pem --sslCAFile     tests/x509gen/ca.pem
      

      Update mongoc_stream_tls_openssl_handshake:

         time_t time_ptr;
         time_ptr = time(NULL); printf ("start handshake %s", ctime (&time_ptr));
         if (BIO_do_handshake (openssl->bio) == 1) {
            time_ptr = time(NULL); printf ("handshake succeeds %s", ctime (&time_ptr));
            if (_mongoc_openssl_check_cert (
                   ssl, host, tls->ssl_opts.allow_invalid_hostname)) {
               RETURN (true);
            }
      
            *events = 0;
            bson_set_error (error,
                            MONGOC_ERROR_STREAM,
                            MONGOC_ERROR_STREAM_SOCKET,
                            "TLS handshake failed: Failed certificate verification");
      
            RETURN (false);
         }
      
         if (BIO_should_retry (openssl->bio)) {
            time_ptr = time(NULL); printf ("handshake should retry %s", ctime (&time_ptr));
            *events = BIO_should_read (openssl->bio) ? POLLIN : POLLOUT;
            RETURN (false);
         }
      

      Slow down the network, on Linux:

      sudo tc qdisc add dev lo root netem delay 300ms
      

      Or on Mac:

      sudo pfctl -E
      (cat /etc/pf.conf && echo "dummynet-anchor \"foo\"" && echo "anchor \"foo\"") | sudo pfctl -f -
      echo "dummynet in quick proto tcp from any to any port 27017 pipe 1" | sudo pfctl -a foo -f -
      sudo dnctl pipe 1 config bw 20000bit/s
      

      Ignore the warnings about "No ALTQ support in kernel", etc.

      The shell should now connect, slowly:

      mongo --ssl --sslPEMKeyFile tests/x509gen/client.pem --sslCAFile tests/x509gen/ca.pem --host localhost
      

      Now recompile and run a test:

      export MONGOC_TEST_URI=mongodb://localhost:27017,localhost:27017 
      export MONGOC_TEST_SSL_PEM_FILE=tests/x509gen/client.pem 
      export MONGOC_TEST_SSL_CA_FILE=tests/x509gen/ca.pem
      ./test-libmongoc --no-fork -l /Client/select_server/single
      

      Listing "localhost:27017" twice lets us see if the topology scanner begins both handshakes concurrently and then both succeed (as expected) or if it begins and completes one handshake, then the other handshake (the bug). In fact, this is what I see with OpenSSL 1.0.1f on Ubuntu 16.04:

      start handshake Thu Dec 15 02:37:12 2016
      handshake succeeds Thu Dec 15 02:37:14 2016
      start handshake Thu Dec 15 02:37:16 2016
      handshake succeeds Thu Dec 15 02:37:17 2016
      

      There are two symptoms of the blocking handshake. First, we see one handshake begin, block for two seconds, then succeed, before the other begins. Second, we expect the function to print "handshake should retry" but it doesn't.

      This blocking behavior is even seen if we add this, although it blocks for a shorter duration:

      export MONGOC_TEST_SSL_WEAK_CERT_VALIDATION=on
      

      I've made some attempts to fix this like so, with no effect:

      BIO_set_nbio (openssl->bio, 1);
      

      There is a reference to a bug with this function and BIO_do_handshake from 15 years ago and another from 9 years ago that I do not believe can apply to OpenSSL 1.0.1.

      I also tried deleting this line from _mongoc_openssl_ctx_new:

      SSL_CTX_set_mode (ctx, SSL_MODE_AUTO_RETRY);
      

      Also no effect on the topology scanner.

      Return your network to normalcy, on Linux:

      sudo tc qdisc del dev lo root
      

      Or on Mac:

      sudo dnctl -f flush
      sudo pfctl -f /etc/pf.conf
      

      Update now resolved for OpenSSL and Windows Secure Channel implementations. The handshake is now parallelized by setting the timeout value to 0 during handshake. Additionally, I've increased throughput with SChannel by increasing the initial receive buffer size, and perhaps fixed a latent bug in the previous code that could cause a deadlock if the buffer grew large enough to receive multiple SSL blocks.

      The Apple Secure Transport implement still handshakes connections serially, but I'm not fixing this at the moment. Mac OS X is used for development of C Driver applications, not for production.

            Assignee:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Reporter:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: