[CDRIVER-3566] Add five second timeout to OCSP requests for SChannel Created: 13/Mar/20  Updated: 25/Sep/23

Status: Backlog
Project: C Driver
Component/s: tests
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Kevin Albertson Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-47934 Create a setParameter to control the ... Closed
Related
Server Compat: 4.4, 4.7

 Description   

Add a five second timeout to OCSP responder requests in schannel, as is recommended in the OCSP spec. This was done for OpenSSL in CDRIVER-3739.

The default timeout was the cause frequent test failures in Client Side Encryption tests in the /client_side_encryption/basic test.

Original ticket summary: /client_side_encryption/basic failures on Windows
Original ticket description:
With log messages like this:

[2020/03/13 11:42:33.388] test error in: C:\data\mci\9cdd471b128edd4f870699963532dbc4\mongoc\src\libmongoc\tests\json-test-operations.c 574:check_result()
[2020/03/13 11:42:33.388] Error in "Insert with deterministic encryption, then find it" test : expected 1 keys, not 0
[2020/03/13 11:42:33.388] Expected:
[2020/03/13 11:42:33.388] { "0" : { "_id" : 1, "encrypted_string" : "string0" } }
[2020/03/13 11:42:33.388] Actual:
[2020/03/13 11:42:33.388] { }
[2020/03/13 11:42:33.397] Begin /client_side_encryption/basic, seed 1584113505
[2020/03/13 11:42:33.397]     { "status": "fail", "test_file": "/client_side_encryption/basic", "seed": "1584113505", "start": 534.869704, "end": 564.229704, "elapsed": 29.360000  },

Though other Client Side Encryption spec tests pass...

Example 1
Example 2



 Comments   
Comment by Kevin Albertson [ 25/Sep/23 ]

Proposed change to apply OCSP timeout still applies.

Workaround is committed to retry failed KMS requests with Secure Channel.

No observed failures on 2023-08-01 to 2023-09-25 for cse-matrix-winssl variant matching "/client_side_encryption/basic" test.

Removing flaky-tests label and Stabilize Evergreen Epic.

Comment by Githook User [ 25/May/20 ]

Author:

{'name': 'Kevin Albertson', 'email': 'kevin.albertson@mongodb.com', 'username': 'kevinAlbs'}

Message: CDRIVER-3566 retry KMS request with schannel
Branch: r1.17
https://github.com/mongodb/mongo-c-driver/commit/e3f803414977b1243dcbd446dd1e6f63825de8e3

Comment by Alexander Golin (Inactive) [ 27/Apr/20 ]

Moving to blocked - waiting for resolution from server on AWS timeout issues. 

Comment by Githook User [ 18/Apr/20 ]

Author:

{'name': 'Kevin Albertson', 'email': 'kevin.albertson@mongodb.com', 'username': 'kevinAlbs'}

Message: CDRIVER-3566 retry KMS request with schannel
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/0fe158df222be450f410e53599188a3a7abfb4ad

Comment by Kevin Albertson [ 16/Apr/20 ]

Some more observations:

  • Retrying once appears to fix the test: https://evergreen.mongodb.com/version/5e97daba57e85a701a2ead1f
    This is consistent with the finding that this is not specific to the "/client_side_encryption/basic" test, but rather (when a failure does occur) the first test to reach out to KMS fails and the rest succeed.
  • A patch build with even more verbose logging indicates the culprit is a socket read taking 30 seconds (but it is not timing out, since the socket timeout is 5 minutes). Here are relevant logs:

    [2020/04/16 04:11:56.471] 2020/04/16 05:11:56.0464: [ 4048]:    DEBUG: stream-tls-secure-channel: Getting ready for state: 3, timeout is 300000
     [2020/04/16 04:11:56.471] 2020/04/16 05:11:56.0464: [ 4048]:    DEBUG: stream-secure-channel: SSL/TLS connection with endpoint (step 2/3)
     [2020/04/16 04:11:56.471] 2020/04/16 05:11:56.0464: [ 4048]:    DEBUG: stream-secure-channel: read - begin
     [2020/04/16 04:11:56.471] 2020/04/16 05:11:56.0464: [ 4048]:    DEBUG: stream-secure-channel: Wanting to read: 17408, timeout is 300000
     [2020/04/16 04:12:17.818] 2020/04/16 05:12:17.0817: [ 4048]:    ERROR: stream-secure-channel: 2. Failed to initialize security context, error code: 0x00090317: The operation completed successfully.
     [2020/04/16 04:12:17.818] 2020/04/16 05:11:56.0464: [ 4048]:    DEBUG: stream-secure-channel: Got 347
     [2020/04/16 04:12:17.818] 2020/04/16 05:11:56.0464: [ 4048]:    DEBUG: stream-secure-channel: read - end
     [2020/04/16 04:12:17.818] 2020/04/16 05:11:56.0464: [ 4048]:    DEBUG: stream-secure-channel: encrypted data buffer: offset 347 length 17408 

     And the subsequent call to InitializeSecurityContext returns SEC_I_CONTEXT_EXPIRED.

  • Much of the secure channel code is a direct copy of libcurl. It may be worth scrutinizing the differences between our implementation. The work to make TLS non-blocking may be relevant. The comment on that commit sounds similar enough:

    This change also avoids what appears to be a potential timeout when
    reading a small chunk of data from the server after reading a large one:
    if the receive buffer had grown large enough to receive an entire reply
    but the driver had only read part of it, the driver would subsequently
    try to read from the socket instead of reading from the buffer, even
    though the server had already sent its whole message.

     - SERVER-47373 may be related. Empty TLS packets were not being handled correctly for the server's schannel handshake. It appears these build failures started occurring in March as well, consistent with when our FLE tests started failing.

Next steps:

  • Try to reproduce this in a spawn host (unable to repro on my Windows machine)
  • Try to reproduce this outside of test code, By just creating a TLS connection directly to a KMS server, to narrow down possibilities.
  • Try to reproduce a situation (perhaps with the mock ssl tests) of an empty TLS packet described in SERVER-47373.
  • If I cannot repro, use netsh to get a packet dump in a failing patch builds
Comment by Kevin Albertson [ 09/Apr/20 ]

Appears to be a failure to perform a TLS handshake with Secure Channel. With lots of logging I was able to reproduce with some more information in this task:

https://evergreen.mongodb.com/task/mongo_c_driver_windows_2015_test_4.2_server_auth_sasl_winssl_cse_patch_c899fc3d233741a6b473fb7d31fd2f878851fb74_5e8f705f850e6157144cff06_20_04_09_18_58_40

Relevant logs copied here:

 
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0649: [ 1972]:    DEBUG: client-side-encryption: KMS connecting to kms.us-east-1.amazonaws.com
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0662: [ 1972]:    DEBUG: client-side-encryption: TCP connection successful, used timeout of 300000
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0662: [ 1972]:    DEBUG: stream-tls-secure-channel: SSL/TLS connection with endpoint AcquireCredentialsHandle
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0662: [ 1972]:    DEBUG: stream-tls-secure-channel: enabled server certificate checks
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0662: [ 1972]:    DEBUG: client-side-encryption: TCP performing handshake
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0662: [ 1972]:    DEBUG:   stream-tls: handshake step...
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0662: [ 1972]:    DEBUG: stream-tls-secure-channel: Getting ready for state: 1, timeout is 300000
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0662: [ 1972]:    DEBUG: stream-secure-channel: SSL/TLS connection with 'kms.us-east-1.amazonaws.com' (step 1/3)
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0662: [ 1972]:    DEBUG: stream-secure-channel: sending initial handshake data: sending 196 bytes...
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0662: [ 1972]:    DEBUG: stream-secure-channel: Wanting to write: 196
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0662: [ 1972]:    DEBUG: stream-secure-channel: Wrote: 196
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0662: [ 1972]:    DEBUG: stream-secure-channel: sent initial handshake data: sent 196 bytes
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0662: [ 1972]:    DEBUG: stream-tls-secure-channel: Step#1 Worked!
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0662: [ 1972]:    DEBUG:   stream-tls: poll begin
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0663: [ 1972]:    DEBUG:   stream-tls: poll end
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0663: [ 1972]:    DEBUG:   stream-tls: handshake step...
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0663: [ 1972]:    DEBUG: stream-tls-secure-channel: Getting ready for state: 2, timeout is 300000
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0663: [ 1972]:    DEBUG: stream-secure-channel: SSL/TLS connection with endpoint (step 2/3)
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0663: [ 1972]:    DEBUG: stream-secure-channel: Wanting to read: 17408, timeout is 300000
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0663: [ 1972]:    DEBUG: stream-secure-channel: Got 5209
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0663: [ 1972]:    DEBUG: stream-secure-channel: encrypted data buffer: offset 5209 length 17408
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0663: [ 1972]:    DEBUG:   stream-tls: poll begin
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0665: [ 1972]:    DEBUG:   stream-tls: poll end
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0665: [ 1972]:    DEBUG:   stream-tls: handshake step...
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0665: [ 1972]:    DEBUG: stream-tls-secure-channel: Getting ready for state: 3, timeout is 300000
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0665: [ 1972]:    DEBUG: stream-secure-channel: SSL/TLS connection with endpoint (step 2/3)
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0665: [ 1972]:    DEBUG: stream-secure-channel: Wanting to read: 17408, timeout is 300000
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0665: [ 1972]:    DEBUG: stream-secure-channel: Got 347
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:18.0665: [ 1972]:    DEBUG: stream-secure-channel: encrypted data buffer: offset 347 length 17408
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG: stream-secure-channel: sending next handshake data: sending 182 bytes...
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG: stream-secure-channel: Wanting to write: 182
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG: stream-secure-channel: Wrote: 182
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG:   stream-tls: poll begin
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG:   stream-tls: poll end
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG:   stream-tls: handshake step...
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG: stream-tls-secure-channel: Getting ready for state: 3, timeout is 275384
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG: stream-secure-channel: SSL/TLS connection with endpoint (step 2/3)
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG: stream-secure-channel: Wanting to read: 17408, timeout is 275384
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG: stream-secure-channel: Got 7
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG: stream-secure-channel: encrypted data buffer: offset 7 length 17408
 [2020/04/09 20:27:43.277] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG: stream-tls-secure-channel: Step#2 FAILED!
 [2020/04/09 20:27:43.278] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG: client-side-encryption: Handshake failed: TLS handshake failed
 [2020/04/09 20:27:43.278] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG: stream-tls-secure-channel: shutting down SSL/TLS connection
 [2020/04/09 20:27:43.278] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG: stream-secure-channel: Wanting to write: 85
 [2020/04/09 20:27:43.278] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG: stream-secure-channel: Wrote: 85
 [2020/04/09 20:27:43.278] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG: stream-tls-secure-channel: clear security context handle
 [2020/04/09 20:27:43.278] 2020/04/09 21:27:43.0277: [ 1972]:    DEBUG: stream-tls-secure-channel: clear credential handle 

And later, a log to stderr:

[2020/04/09 20:27:43.289] 2020/04/09 21:27:43.0277: [ 1972]:    ERROR: stream-secure-channel: 2. Failed to initialize security context, error code: 0x00090317: The operation completed successfully.
[2020/04/09 20:27:43.289] 2020/04/09 21:27:43.0277: [ 1972]:    ERROR:       mongoc: operation { "name" : "insertOne", "arguments" : { "document" : { "_id" : 1, "encrypted_string" : "string0" } } } failed with error: TLS handshake failed 

Notice the 30 second jump in step two of the handshake. Even when the test passes, there appears to be a large jump in time in between the two lines. (example of passing).

Looking at where that jump is, my hypothesis is that this InitializeSecurityContext is what is taking so long. But I don't yet know why.

The Windows error lookup tool shows 0x00090317 corresponds to SEC_I_CONTEXT_EXPIRED.

Generated at Wed Feb 07 21:18:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.