[SERVER-56217] PoolForHost::_maxInUse is init to <int>::max and cannot be changed, and when connections are maxed DBConnectionPool::get() always throws Created: 20/Apr/21  Updated: 29/Oct/23  Resolved: 23/Apr/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.0.24
Fix Version/s: 4.0.25

Type: Bug Priority: Critical - P2
Reporter: Andrew Shuvalov (Inactive) Assignee: Andrew Shuvalov (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-56229 Forward-port SERVER-56217 bug in conn... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2021-05-03
Participants:

 Description   

PoolForHost::_maxInUse is init to <int>::max and cannot be changed, and when connections are maxed using the DBConnectionPool::_maxInUse DBConnectionPool::get() always throws, immediately.

To understand this bug first pay attention that there is PoolForHost::_maxInUse here and DBConnectionPool::_maxInUse here. They are not synchronized in any way.

The DBConnectionPool::_maxInUse can be set with public method setMaxInUse() here. The PoolForHost::_maxInUse is initialized in constructor to std::numeric_limits<int>::max() and there is no code path that changes it, it remains MAX_INT for every host for the lifetime of the pool.

The PoolForHost::_maxInUse is used only here, in waitForFreeConnection(). The logic is that waitForFreeConnection() is invoked from DBConnectionPool::Detail::get() here when the DBConnectionPool::_maxInUse is exhausted.

However, this code fragment in waitForFreeConnection() always throws:

    auto condition = [&] { return (numInUse() < _maxInUse || _inShutdown.load()); };
 
    if (timeout > 0) {
        stdx::chrono::seconds timeoutSeconds{timeout};
 
        // If we timed out waiting without getting a new connection, throw.
        uassert(ErrorCodes::ExceededTimeLimit,
                str::stream() << "too many connections to " << _hostName << ":" << timeout,
                !_cv.wait_for(lk, timeoutSeconds, condition));

because the 'condition' "numInUse() < _maxInUse" is always true because _maxInUse is MAX_INT. Thus the "_cv.wait_for()" never blocks because the condition is always true when it's invoked. However the assertion "!_cv.wait_for(...)" always throws because wait_for() always returns true and its negation is asserted.

As a result of this bug the behavior of DBConnectionPool is that once the connection count reaches the limit the get() always throws, immediately. The code actually expects that it should block for specified timeout, throwing without blocking should create very tight loops somewhere up the stack, if the caller logic retries whatever it needs the connection pool for.

The proposed fix actually contains 4 bugfixes in one change, CR pending

1. PoolForHost::_maxInUse is removed as there is a similar field in DBConnectionPool and it was never set properly

2. PoolKey::timeout is changed from double to int64. The reason is that it is dangerous to make a double key for a container, container lookup may become non-deterministic because of double precision ops errors

3. In PoolForHost::waitForFreeConnection(), "!_cv.wait_for(...)" is changed to "_cv.wait_for()". The inversion of the wait return value for assertion is a bug, in case of timeout wait() returns false and we should assert it without inversion

4. DBConnectionPool::decrementEgress() must notify waiters. This is SERVER-56147, the CV is never notified if the connection is released using this method.



 Comments   
Comment by Githook User [ 26/Apr/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-56217 SERVER-56147: maxed out connection pool should respect get() timeout and CV should be notified on connection release
Branch: v4.0
https://github.com/mongodb/mongo/commit/3d8730be2c747dfe869614e215808ceddafa8516

Comment by Githook User [ 26/Apr/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-56217 SERVER-56147: maxed out connection pool should respect get() timeout and CV should be notified on connection release
Branch: v4.0
https://github.com/mongodb/mongo/commit/1d5cbc113d4318b79831113790c3abd09ba25098

Comment by Githook User [ 26/Apr/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-56217 SERVER-56147: maxed out connection pool should respect get() timeout and CV should be notified on connection release
Branch: v4.0
https://github.com/mongodb/mongo/commit/5c8ec91022cd185490fb960a5a7fdaf1c58f94d7

Comment by Andrew Shuvalov (Inactive) [ 23/Apr/21 ]

bruce.lucas, as we use two different implementations of connection pool: src/mongo/client/connpool.h (this is the older one) and src/mongo/executor/connection_pool.h, I was surprised to find that the old one usage is limited to:

1. src/mongo/client/global_conn_pool.cpp "globalConnPool"
2. src/mongo/s/client/shard_connection.cpp "shardConnectionPool"

Thus the errors I am fixing are not expected to affect the production at all, this is why nobody fixed them in 5 years. However, I think the DBConnectionPool should be either deprecated or fixed, keeping definitely broken code is dangerous, as one unrelated change can actually expose the bugs. There is no TODO for anyone else except me.

Comment by Githook User [ 23/Apr/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-56217: PoolForHost::_maxInUse is init to <int>::max and cannot be changed, and when connections are maxed DBConnectionPool::get() always throws
Branch: v4.0
https://github.com/mongodb/mongo/commit/7d39fbc97a721752fce1c94f37ad4dea3dd5307c

Comment by Andrew Shuvalov (Inactive) [ 21/Apr/21 ]

bruce.lucas please allow me several days to answer this. In short the client requests should timeout immediately when connection pool at mongos is full, however I do not know how exactly it will show up in the logs before targeting this in repro. I'm working on all code fixes accumulated for HELP ticket and only after I'll get back to stress tests.

Comment by Bruce Lucas (Inactive) [ 21/Apr/21 ]

andrew.shuvalov thanks for finding this. Can you add to the description the signature that would be seen in the logs - exception message, invariant, stack trace, or whatever? This will help us find this issue if we encounter it in the future.

Comment by Andrew Shuvalov (Inactive) [ 21/Apr/21 ]

Filed SERVER-56229 to port this to head, will file backports from 5.0 when it's done.

Generated at Thu Feb 08 05:38:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.