[DRIVERS-2480] Mitigate negative effects of OCSP endpoint timeouts Created: 25/Oct/22  Updated: 08/Nov/22

Status: Backlog
Project: Drivers
Component/s: OCSP
Fix Version/s: None

Type: Spec Change Priority: Major - P3
Reporter: Jeremy Mikola Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to CDRIVER-4522 Possible improvements to mitigate neg... Backlog
is related to DRIVERS-1204 Easier debugging with standardized lo... Implementing
is related to DRIVERS-2494 Add OCSP logging Backlog
Driver Changes: Needed

 Description   

Summary

When OCSP stapling is unavailable, drivers may attempt to contact one or more OCSP endpoints. Per Suggested OCSP Behavior, the default timeout is five seconds.

Drivers use connectTimeoutMS as the timeout for connection handshake (Server Monitoring spec) and the handshake includes TLS (Handshake spec). Therefore, an inaccessible OCSP endpoint could add five seconds to the handshake.

If the application is using a smaller connectTimeoutMS value, an inaccessible OCSP endpoint could prevent the driver from establishing a connection to the server. This is irrespective of whether a driver has "soft fail" behavior (i.e. TLS continues if OCSP cannot complete). Drivers with "soft fail" behavior would allow the connection to continue after hitting an OCSP timeout, but only if connectTimeoutMS has not been exhausted.

When this was observed in a customer report involving the PHP driver, there was originally no indication that TLS/OCSP was involved, as the problem manifested itself as a server selection failure due to a socket timeout attempting to establish a connection. We ultimately confirmed the issue thanks to libmongoc trace logs

There are several courses of action we might consider to address this:

  • Allow OCSP timeouts to be configurable (if supported by a driver's TLS library)
  • Provide documentation to educate users on the interaction between OCSP and connection timeouts. If OCSP timeouts cannot be configured, users should be aware that the five second default might exhaust connectTimeoutMS
  • Note that that tlsDisableOCSPEndpointCheck and tlsDisableCertificateRevocationCheck may be used to work around this issue. In the related PHP issue, the customer used tlsAllowInvalidCertificates, which is unadvisable because it disables much more than OCSP.
  • Add logging for OCSP. There is presently no ticket to add log messages to the OCSP spec (see: Logging component and linked issues in DRIVERS-1204).

Note: the Client Side Operations Timeout spec may influence OCSP timeouts; however, even if OCSP timeouts are configurable (and will dynamically scales down based on the remaining timeoutMS), I think we'd still face an issue with exposing the source of the timeout. In that case, action items for documentation and logging may still be worth addressing.

Motivation

Who is the affected end user?

Applications using TLS with OCSP but without OCSP stapling.

How does this affect the end user?

OCSP timeouts could prevent the driver from making server connections by exhausting the connection timeout.

How likely is it that this problem or use case will occur?

This is rare, but could happen due to many factors: app server firewall preventing outgoing HTTP requests, OCSP server experiencing downtime, high latency contacting the OCSP server.

If the problem does occur, what are the consequences and how severe are they?

Ranges from merely delaying a connection to preventing it entirely.

Is this issue urgent?

No.

Is this ticket required by a downstream team?

No.

Is this ticket only for tests?

No.



 Comments   
Comment by Kaitlin Mahar [ 03/Nov/22 ]

jmikola@mongodb.com, I filed and linked DRIVERS-2494 to cover OCSP logging specifically.

Comment by Tom Selander [ 25/Oct/22 ]

Leads Triage: Backlogging this for now, we may decide to go with the third suggested option.

Comment by Jeremy Mikola [ 25/Oct/22 ]

kaitlin.mahar@mongodb.com: I'm not sure if OCSP logging would fall under SDAM (DRIVERS-1670) (as I expect the handshake spec might), but if not you may want to create a separate ticket for the OCSP and link it up here.

Generated at Thu Feb 08 08:25:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.