[SERVER-56217] PoolForHost::_maxInUse is init to <int>::max and cannot be changed, and when connections are maxed DBConnectionPool::get() always throws Created: 20/Apr/21 Updated: 29/Oct/23 Resolved: 23/Apr/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.0.24 |
| Fix Version/s: | 4.0.25 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Andrew Shuvalov (Inactive) | Assignee: | Andrew Shuvalov (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Sprint: | Sharding 2021-05-03 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
PoolForHost::_maxInUse is init to <int>::max and cannot be changed, and when connections are maxed using the DBConnectionPool::_maxInUse DBConnectionPool::get() always throws, immediately. To understand this bug first pay attention that there is PoolForHost::_maxInUse here and DBConnectionPool::_maxInUse here. They are not synchronized in any way. The DBConnectionPool::_maxInUse can be set with public method setMaxInUse() here. The PoolForHost::_maxInUse is initialized in constructor to std::numeric_limits<int>::max() and there is no code path that changes it, it remains MAX_INT for every host for the lifetime of the pool. The PoolForHost::_maxInUse is used only here, in waitForFreeConnection(). The logic is that waitForFreeConnection() is invoked from DBConnectionPool::Detail::get() here when the DBConnectionPool::_maxInUse is exhausted. However, this code fragment in waitForFreeConnection() always throws:
because the 'condition' "numInUse() < _maxInUse" is always true because _maxInUse is MAX_INT. Thus the "_cv.wait_for()" never blocks because the condition is always true when it's invoked. However the assertion "!_cv.wait_for(...)" always throws because wait_for() always returns true and its negation is asserted. As a result of this bug the behavior of DBConnectionPool is that once the connection count reaches the limit the get() always throws, immediately. The code actually expects that it should block for specified timeout, throwing without blocking should create very tight loops somewhere up the stack, if the caller logic retries whatever it needs the connection pool for. The proposed fix actually contains 4 bugfixes in one change, CR pending1. PoolForHost::_maxInUse is removed as there is a similar field in DBConnectionPool and it was never set properly 2. PoolKey::timeout is changed from double to int64. The reason is that it is dangerous to make a double key for a container, container lookup may become non-deterministic because of double precision ops errors 3. In PoolForHost::waitForFreeConnection(), "!_cv.wait_for(...)" is changed to "_cv.wait_for()". The inversion of the wait return value for assertion is a bug, in case of timeout wait() returns false and we should assert it without inversion 4. DBConnectionPool::decrementEgress() must notify waiters. This is |
| Comments |
| Comment by Githook User [ 26/Apr/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Githook User [ 26/Apr/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Githook User [ 26/Apr/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Andrew Shuvalov (Inactive) [ 23/Apr/21 ] |
|
bruce.lucas, as we use two different implementations of connection pool: src/mongo/client/connpool.h (this is the older one) and src/mongo/executor/connection_pool.h, I was surprised to find that the old one usage is limited to: 1. src/mongo/client/global_conn_pool.cpp "globalConnPool" Thus the errors I am fixing are not expected to affect the production at all, this is why nobody fixed them in 5 years. However, I think the DBConnectionPool should be either deprecated or fixed, keeping definitely broken code is dangerous, as one unrelated change can actually expose the bugs. There is no TODO for anyone else except me. |
| Comment by Githook User [ 23/Apr/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Andrew Shuvalov (Inactive) [ 21/Apr/21 ] |
|
bruce.lucas please allow me several days to answer this. In short the client requests should timeout immediately when connection pool at mongos is full, however I do not know how exactly it will show up in the logs before targeting this in repro. I'm working on all code fixes accumulated for HELP ticket and only after I'll get back to stress tests. |
| Comment by Bruce Lucas (Inactive) [ 21/Apr/21 ] |
|
andrew.shuvalov thanks for finding this. Can you add to the description the signature that would be seen in the logs - exception message, invariant, stack trace, or whatever? This will help us find this issue if we encounter it in the future. |
| Comment by Andrew Shuvalov (Inactive) [ 21/Apr/21 ] |
|
Filed SERVER-56229 to port this to head, will file backports from 5.0 when it's done. |