Uploaded image for project: 'C Driver'
  1. C Driver
  2. CDRIVER-4522

Possible improvements to mitigate negative effects of OCSP endpoint timeouts

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: OCSP
    • Labels:
      None

      While investigating a related HELP ticket, I realized that the design of libmongoc's OCSP checks contributed to a server selection failure. This may be related to DRIVERS-2480 (more generally mitigating issues of OCSP endpoint failures), but I wanted to create a separate issue since this does pertain to libmongoc's internals.

      _contact_ocsp_responder() calls _mongoc_http_send(), which uses mongoc_client_connect_tcp(). That's a common function also used to establish MongoDB server connections. Within that function, there is a call getaddrinfo(3) for DNS resolution followed by a loop across its results until a successful socket connection is established. Each of those attempts uses the original timeout (i.e. five seconds for OCSP). The trace logs in the related HELP ticket don't reveal the DNS results that were attempted, but we do see that the first two attempts exhausted their five second timeout and the last two failed quickly with "101 Network is unreachable" errors.

      A combination of factors contributed to the OP's error of not being able to connect to their cluster due to an inaccessible OCSP endpoint:

      • DNS resolution on the OCSP hostname prompted libmongoc to make several connection attempts, which are not parallelized.
      • A five-second delay on two or more OCSP hosts is sufficient to exhaust the default 10-second connection timeout.
      • Multiple MongoDB hosts from the initial SRV lookup aren't directly responsible for connection timeouts, but they do exacerbate the situation by making it more likely to exhaust the 30-second server selection timeout (vs. a single-host seed list that might fail after 10 seconds).

      With respect to libmongoc, I have the following questions:

      • Should the timeout option for mongoc_client_connect_tcp apply across all attempts instead of for each? Would this even be worth considering in light of upcoming client-side operation timeout work (CDRIVER-3786)?
      • More generally, would it be feasible to parallelize connection attempts in mongoc_client_connect_tcp? That could be wasteful for the "good" path since the function only needs to establish a single stream.
      • Is there any way to parallelize OCSP checks (assuming it's worth doing)? I'll note that OpenSSL TLS handshakes were made non-blocking in libmongoc 1.10 by CDRIVER-1956 (other implementations such as Secure Transport are still blocking per CDRIVER-2885), but that pre-dated any OCSP work.

            Assignee:
            Unassigned Unassigned
            Reporter:
            jmikola@mongodb.com Jeremy Mikola
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: