[SERVER-30643] Performance regression with SSL Created: 14/Aug/17  Updated: 30/Oct/23  Resolved: 22/Aug/17

Status: Closed
Project: Core Server
Component/s: Networking, Security
Affects Version/s: 3.4.4
Fix Version/s: 3.4.9, 3.5.13

Type: Bug Priority: Critical - P2
Reporter: Bruce Lucas (Inactive) Assignee: Spencer Jackson
Resolution: Fixed Votes: 0
Labels: 3.4-SSL
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
Related
related to SERVER-38945 SSL performance regression Closed
related to SERVER-30195 Increase connection pool timeout for ... Closed
is related to SERVER-30628 Timeouts between 'mongos' and replica... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.4
Sprint: Platforms 2017-08-21, Platforms 2017-09-11
Participants:
Case:
Linked BF Score: 0

 Description   

Create a collection with a single 1 kB document; start 1000 threads doing a findOne() on that collection. Observed query rate ~55 k/s in 3.4.3, drops to ~35 k/s in 3.4.4.



 Comments   
Comment by Githook User [ 22/Aug/17 ]

Author:

{'username': 'spencerjackson', 'email': 'spencer.jackson@mongodb.com', 'name': 'Spencer Jackson'}

Message: SERVER-30643: Ensure thread IDs observed by OpenSSL are uniformly distributed

(cherry picked from commit d40503d57d9fd27d1ad59e4d74aa827191d77564)
Branch: v3.4
https://github.com/mongodb/mongo/commit/9747381e76070df44d824b037e67a0bfc3472502

Comment by Githook User [ 22/Aug/17 ]

Author:

{'username': 'spencerjackson', 'email': 'spencer.jackson@mongodb.com', 'name': 'Spencer Jackson'}

Message: SERVER-30643: Ensure thread IDs observed by OpenSSL are uniformly distributed
Branch: master
https://github.com/mongodb/mongo/commit/d40503d57d9fd27d1ad59e4d74aa827191d77564

Comment by Bruce Lucas (Inactive) [ 16/Aug/17 ]

The following change, partially reverting SERVER-28530, fixes the performance regression; tbd why.

diff --git a/src/mongo/util/net/ssl_manager.cpp b/src/mongo/util/net/ssl_manager.cpp
index da8af7e..b26d958 100644
--- a/src/mongo/util/net/ssl_manager.cpp
+++ b/src/mongo/util/net/ssl_manager.cpp
@@ -139,6 +139,10 @@ IMPLEMENT_ASN1_ENCODE_FUNCTIONS_const_fname(ASN1_SEQUENCE_ANY, ASN1_SET_ANY, ASN
  * terminated, it must call ERR_remove_state or ERR_remove_thread_state. These functions may
  * themselves invoke the application provided callback.
  */
+
+AtomicUInt32 _next;
+MONGO_TRIVIALLY_CONSTRUCTIBLE_THREAD_LOCAL int _my_id = _next.fetchAndAdd(1);
+
 class SSLThreadInfo {
 public:
     static unsigned long getID() {
@@ -147,10 +151,13 @@ public:
 #ifdef _WIN32
         return GetCurrentThreadId();
 #else
+        /*
         static_assert(sizeof(void*) == sizeof(unsigned long),
                       "OpenSSL needs the address of a thread-unique object to be castable to"
                       "unsigned long");
         return reinterpret_cast<unsigned long>(&errno);
+        */
+        return _my_id;
 #endif
     }

Comment by Spencer Jackson [ 16/Aug/17 ]

SSL_read is supposed to return SSL_ERROR_WANT_READ on basically every read. When we see it, we have to recv a socket, and dump the contents into our network BIO. When we re-call SSL_read, it pulls that data out of the other half of that BIO. It may additionally return SSL_ERROR_WANT_WRITE, to force us to shuttle back TLS overhead to the client.

Manually stepping across all calls to ERR_get_state during a single query for such a document as described in the description doesn't reveal any behavioral differences between 3.4.3 and 3.4.4.

Comment by Bruce Lucas (Inactive) [ 14/Aug/17 ]

In 3.4.3 most of the threads are waiting here for the next request from a client:

#0  0x00007f4b595a901f in __libc_recv (fd=21, buf=0x7f4b4fc89cd0, n=5, flags=16384) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
#1  0x00005563dc482874 in mongo::(anonymous namespace)::SSLManager::_flushNetworkBIO(mongo::SSLConnection*) [clone .isra.158] ()
#2  0x00005563dc482abf in mongo::(anonymous namespace)::SSLManager::_doneWithSSLOp(mongo::SSLConnection*, int) [clone .isra.159] ()
#3  0x00005563dc482e63 in mongo::(anonymous namespace)::SSLManager::SSL_read(mongo::SSLConnection*, void*, int) ()
#4  0x00005563dc47cec7 in mongo::Socket::recv(char*, int) ()

Whereas in 3.4.4 most are waiting here:

#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f60bada3d82 in __GI___pthread_mutex_lock (mutex=0x556803dc9810) at ../nptl/pthread_mutex_lock.c:115
#2  0x0000556800ff5809 in mongo::(anonymous namespace)::SSLThreadInfo::lockingCallback(int, int, char const*, int) ()
#3  0x00007f60bb9458e7 in CRYPTO_add_lock () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#4  0x00007f60bb9ff462 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#5  0x00007f60bb9ffa0c in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#6  0x00007f60bba003b4 in ERR_get_state () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#7  0x00007f60bba006b6 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#8  0x00007f60bbd66170 in SSL_get_error () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#9  0x0000556800ff6104 in mongo::(anonymous namespace)::SSLManager::_doneWithSSLOp(mongo::SSLConnection*, int) [clone .isra.155] ()
#10 0x0000556800ff64c3 in mongo::(anonymous namespace)::SSLManager::SSL_read(mongo::SSLConnection*, void*, int) ()
#11 0x0000556800ff0937 in mongo::Socket::recv(char*, int) ()

Since there are no actual SSL errors logged, does this mean that for some reason in 3.4.4 SSL_read is returning SSL_ERROR_WANT_READ frequently resulting in more calls to SSL_get_error, creating a bottleneck?

Generated at Thu Feb 08 04:24:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.