[CDRIVER-2624] Topology scanner renegotiates TLS on each heartbeat Created: 23/Apr/18  Updated: 28/Oct/23  Resolved: 24/Apr/18

Status: Closed
Project: C Driver
Component/s: tls
Affects Version/s: 1.8.0
Fix Version/s: 1.10.0

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by PHPC-1169 TLS renegotiation errors during heart... Closed
Related
related to CDRIVER-2934 Disable TLS renegotiation when possible Closed

 Description   

Every heartbeatFrequencyMS, the driver calls "isMaster" on all established monitoring connections. At least with Windows SChannel, and perhaps with OpenSSL and Apple's Secure Transport as well, the driver begins a new TLS handshake even though the connection is already handshaken. When the server is built with Secure Transport or SChannel (which landed circa 3.7.4), it rejects this renegotiation and closes the connection.

Servers using OpenSSL (which until recently was the server's only TLS implementation) will allow the renegotiation, so we didn't see errors from this bug in the past. Even now when the server might hang up on the client, the driver will usually recover by opening a new monitoring connection almost immediately, so we rarely see problems.

This was discovered testing CDRIVER-2404: since the driver can take more than heartbeatFrequencyMS to generate 10,001 client sessions, and since the driver is not allowed to reconnect in order send the endSessions command, it can fail to send the command and cause a test failure.



 Comments   
Comment by Githook User [ 24/Apr/18 ]

Author:

{'email': 'jesse@mongodb.com', 'username': 'ajdavis', 'name': 'A. Jesse Jiryu Davis'}

Message: CDRIVER-2624 don't re-handshake TLS stream
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/c9406ff178d6a043cffb0b5d318da09a67e9372a

Comment by A. Jesse Jiryu Davis [ 24/Apr/18 ]

The driver renegotiates TLS on every server check - that is, every heartbeatFrequencyMS or every time a server check is requested for any other reason (according to SDAM logic). It does renegotiate if the driver is built with SChannel and Secure Transport, but does not renegotiate with OpenSSL. We lucked out with OpenSSL: we keep calling BIO_do_handshake but that function does nothing if the connection is already handshaken. I bet LibreSSL inherits OpenSSL's behavior here.

There is no purpose to renegotiation, it's a bug.

I'm semi-confident this bug is only caused by client behavior, and will cause disconnects with any MongoDB server configuration, based on the Atlas test I just did.

TLS renegotiation never occurs on application sockets, only monitoring sockets. For single clients, where sockets are shared, this will cause a renegotiation and disconnect on the shared sockets when they are used for monitoring. The driver will recover in both single and pooled mode because of safeguards like CDRIVER-2075, often without visible errors.

Comment by Jeremy Mikola [ 24/Apr/18 ]

At least with Windows SChannel, and perhaps with OpenSSL and Apple's Secure Transport as well, the driver begins a new TLS handshake even though the connection is already handshaken. When the server is built with Secure Transport or SChannel (which landed circa 3.7.4), it rejects this renegotiation and closes the connection.

Given that 3.6 servers in Atlas are running with OpenSSL, your last comment demonstrates that this is reproducible when libmongoc is using Secure Transport. If I understand the OP correctly, this bug originated when testing libmongoc (with Secure Channel) against a 3.7 server using Secure Channel or Secure Transport.

Is the issue caused by either the client or server using something other than OpenSSL, or this entirely a bug with libmongoc's Secure Channel and Secure Transport implementations? Is LibreSSL unaffected?


In the interest of improving my understanding of the subject, I have a few more questions. I'm not very familiar with TLS renegotiation, but this Stack Overflow thread explains it as:

It occurs after either side has expired the session and continues sending data. It means either that the session has simply expired due to timeout, or that a peer wants to change the cipher suite, or wants to request a peer certificate and hasn't already done so.

With regard to the driver, are we doing this for the benefit of long-running applications so that active connections can be kept alive when client and server certificates are updated externally? If so, I assume monitoring intervals are the sensible time to do so.

Per SDAM, we know that threaded clients conduct all monitoring on a set of sockets separate from the application pool. I assume that TLS renegotiation occurs on those monitoring sockets alongside each isMaster. If sockets in the application pool are not actively monitored (I assume we rely on TCP to keep them alive), when (if ever) does TLS renegotiation occur on those sockets?

With respect to single-threaded clients, where the monitoring and application sockets are one and the same, it sounds like we expect TLS renogatiation to happen during each monitoring event (just as with threaded clients). Is this only ever heartbeatFrequencyMS? I know single-threaded clients increase their monitoring frequency after socket errors (following cooldownMS), but I assume that would only entail initial TLS negotation since the sockets would be newly established.

Comment by A. Jesse Jiryu Davis [ 24/Apr/18 ]

Seems related, yes. I tried connecting to an Atlas server using C Driver 1.8.2 on Mac with Secure Transport and a URI like:

"mongodb://admin:PASSWORD@replicaset-shard-00-00-oztdp.mongodb-dev.net:27017,replicaset-shard-00-01-oztdp.mongodb-dev.net:27017,replicaset-shard-00-02-oztdp.mongodb-dev.net:27017/?replicaSet=replicaSet-shard-0&ssl=true&heartbeatfrequencyms=500&connectTimeoutMS=500&serverselectiontimeoutms=100"

This is the "atlas_replset" key in the mongo-c-driver Evergreen config, along with a particular set of timing fields meant to tickle the bug. I use this URI with a single mongoc_client_t and a loop like:

while (true) {
   cursor = mongoc_collection_find_with_opts (
      collection,
      &query,
      NULL, 
      NULL);
 
   while (mongoc_cursor_next (cursor, &doc)) {
      str = bson_as_canonical_extended_json (doc, NULL);
      fprintf (stdout, "%s\n", str);
      bson_free (str);
   }
 
   if (mongoc_cursor_error (cursor, &error)) {
      fprintf (stderr, "Cursor Failure: %s\n", error.message);
      return EXIT_FAILURE;
   }
 
   sleep(1);
}

I saw that the Atlas servers hang up on the driver when it tries to renegotiate TLS, and within a few seconds the program logs and quits:

Cursor Failure: No suitable servers found (`serverSelectionTryOnce` set): [connection timeout calling ismaster on 'replicaset-shard-00-00-oztdp.mongodb-dev.net:27017'] [connection timeout calling ismaster on 'replicaset-shard-00-01-oztdp.mongodb-dev.net:27017'] [TLS handshake failed: -9806 calling ismaster on 'replicaset-shard-00-02-oztdp.mongodb-dev.net:27017']

In pooled mode the driver is more resilient, but suffers frequent pauses of several tenths of seconds while it's recovering from the server hangup and waiting for the opportunity to reconnect.

With the patch applied that I'm developing for this bug, the same program is resilient and doesn't suffer these pauses.

Comment by Jeremy Mikola [ 23/Apr/18 ]

How does this manifest itself with regard to error messages from libmongoc? An Atlas user recently reported mongodb/mongo-php-driver#791 (off-shoot of MMSSUPPORT-19732) where they were experiencing random dropped connections and we've not been able to identify the cause. Their error messages typically had the form:

No suitable servers found (`serverSelectionTryOnce` set): [connection timeout calling ismaster on 'cluster0-shard-00-00-b2gpc.mongodb.net:27017'] [connection timeout calling ismaster on 'cluster0-shard-00-01-b2gpc.mongodb.net:27017'] [TLS handshake failed: -9806 calling ismaster on 'cluster0-shard-00-02-b2gpc.mongodb.net:27017']

Their PHP driver is using Secure Transport and error code -9806 corresponds to errSSLClosedAbort; however, that's about as far as we got. The fact that all connections timed out at once is interesting, although it could just as easily be a network connectivity issue on their application server.

I'm a bit baffled by the fact that we see only ony "TLS handshake failed" error message, but perhaps that's explained by Secure Transport (and Secure Channel) not yet supporting async negotiations.

Generated at Wed Feb 07 21:15:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.