[SERVER-34120] scoped connection not being returned to the pool Created: 24/Mar/18  Updated: 08/Jan/24  Resolved: 05/Sep/18

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: 3.6.3
Fix Version/s: 3.6.8, 4.0.3, 4.1.3

Type: Bug Priority: Minor - P4
Reporter: Mouhcine Assignee: Mira Carey
Resolution: Fixed Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File primary.txt     Text File secondary-1.txt     Text File secondary-2.txt    
Issue Links:
Backports
Depends
depends on SERVER-36332 CursorNotFound error in GetMore on a ... Closed
Duplicate
is duplicated by SERVER-34124 scoped connection not being returned ... Closed
is duplicated by SERVER-36923 Scoped connections not being returned... Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: Platforms 2018-07-02, Platforms 2018-07-16, Platforms 2018-07-30, Platforms 2018-08-13, Platforms 2018-09-10
Participants:
Case:

 Description   

I have a replicaSet with two servers and I start getting this "scoped connection not being returned to the pool" when I upgraded to 3.6.3 in every 5 minute...

My servers are setup with sync NTP and I m sure they have the sync time correctly

ANY idea what it could cause this... Thank you

2018-03-24T18:14:17.470+0000 I NETWORK  [thread7] Starting new replica set monitor for REP-1/server-1:27017,server-2:27017
2018-03-24T18:14:17.476+0000 I NETWORK  [thread7] Successfully connected to server-1:27017 (9 connections now open to server-1:27017 with a 0 second timeout)
2018-03-24T18:14:17.478+0000 I NETWORK  [thread7] scoped connection to server-1:27017 not being returned to the pool
2018-03-24T18:14:17.478+0000 I NETWORK  [thread7] Starting new replica set monitor for REP-1/server-1:27017,server-2:27017
2018-03-24T18:14:17.483+0000 I NETWORK  [thread7] Successfully connected to server-1:27017 (10 connections now open to server-1:27017 with a 0 second timeout)
2018-03-24T18:14:17.485+0000 I NETWORK  [thread7] scoped connection to server-1:27017 not being returned to the pool
2018-03-24T18:14:17.486+0000 I NETWORK  [thread7] Starting new replica set monitor for REP-1/server-1:27017,server-2:27017
2018-03-24T18:14:17.490+0000 I NETWORK  [thread7] Successfully connected to server-1:27017 (11 connections now open to server-1:27017 with a 0 second timeout)
2018-03-24T18:14:17.493+0000 I NETWORK  [thread7] scoped connection to server-1:27017 not being returned to the pool
2018-03-24T18:14:17.493+0000 I NETWORK  [thread7] Starting new replica set monitor for REP-1/server-1:27017,server-2:27017
2018-03-24T18:14:17.497+0000 I NETWORK  [thread7] Successfully connected to server-1:27017 (12 connections now open to server-1:27017 with a 0 second timeout)
2018-03-24T18:14:17.500+0000 I NETWORK  [thread7] scoped connection to server-1:27017 not being returned to the pool
2018-03-24T18:19:17.470+0000 I NETWORK  [thread7] Starting new replica set monitor for REP-1/server-1:27017,server-2:27017
2018-03-24T18:19:17.476+0000 I NETWORK  [thread7] Successfully connected to server-1:27017 (13 connections now open to server-1:27017 with a 0 second timeout)
2018-03-24T18:19:17.478+0000 I NETWORK  [thread7] scoped connection to server-1:27017 not being returned to the pool
2018-03-24T18:19:17.478+0000 I NETWORK  [thread7] Starting new replica set monitor for REP-1/server-1:27017,server-2:27017
2018-03-24T18:19:17.483+0000 I NETWORK  [thread7] Successfully connected to server-1:27017 (14 connections now open to server-1:27017 with a 0 second timeout)
2018-03-24T18:19:17.485+0000 I NETWORK  [thread7] scoped connection to server-1:27017 not being returned to the pool
2018-03-24T18:19:17.486+0000 I NETWORK  [thread7] Starting new replica set monitor for REP-1/server-1:27017,server-2:27017
2018-03-24T18:19:17.490+0000 I NETWORK  [thread7] Successfully connected to server-1:27017 (15 connections now open to server-1:27017 with a 0 second timeout)
2018-03-24T18:19:17.493+0000 I NETWORK  [thread7] scoped connection to server-1:27017 not being returned to the pool
2018-03-24T18:19:17.493+0000 I NETWORK  [thread7] Starting new replica set monitor for REP-1/server-1:27017,server-2:27017
2018-03-24T18:19:17.498+0000 I NETWORK  [thread7] Successfully connected to server-1:27017 (16 connections now open to server-1:27017 with a 0 second timeout)
2018-03-24T18:19:17.500+0000 I NETWORK  [thread7] scoped connection to server-1:27017 not being returned to the pool



 Comments   
Comment by Githook User [ 05/Sep/18 ]

Author:

{'name': 'Jason Carey', 'email': 'jcarey@argv.me', 'username': 'hanumantmk'}

Message: SERVER-34120 call done() in SessionsCollectionRS

We make connections from the secondaries to the primary that we fail to
call done() on. This causes those connections to fall out of the pool,
reducing connection pooling and producing worrisome log messages.

(cherry picked from commit fac4037c57431e3318e04cc850b9ea391ff5f19c)
Branch: v4.0
https://github.com/mongodb/mongo/commit/46f73741d70dea240c5837c5d61edce729d38805

Comment by Githook User [ 05/Sep/18 ]

Author:

{'name': 'Jason Carey', 'email': 'jcarey@argv.me', 'username': 'hanumantmk'}

Message: SERVER-34120 call done() in SessionsCollectionRS

We make connections from the secondaries to the primary that we fail to
call done() on. This causes those connections to fall out of the pool,
reducing connection pooling and producing worrisome log messages.

(cherry picked from commit fac4037c57431e3318e04cc850b9ea391ff5f19c)
Branch: v3.6
https://github.com/mongodb/mongo/commit/baadc3aea5a6b178b7ef213c161c6688c78777ba

Comment by Githook User [ 05/Sep/18 ]

Author:

{'name': 'Jason Carey', 'email': 'jcarey@argv.me', 'username': 'hanumantmk'}

Message: SERVER-34120 call done() in SessionsCollectionRS

We make connections from the secondaries to the primary that we fail to
call done() on. This causes those connections to fall out of the pool,
reducing connection pooling and producing worrisome log messages.
Branch: master
https://github.com/mongodb/mongo/commit/fac4037c57431e3318e04cc850b9ea391ff5f19c

Comment by Mira Carey [ 28/Aug/18 ]

After looking at the solution we pursued in SERVER-36332, it looks like this won't be the fix. Bringing this back in to address directly

Comment by Mira Carey [ 17/Aug/18 ]

As part of the fix for SERVER-36332, we're intending to remove the code that's failing in this ticket (as well as making that codepath more robust in general). Rather than this smaller fix, I'll close this out as a dupe when that tickets land (after verifying that it indeed solves the problem here)

Comment by RANJEET [ 02/Aug/18 ]

I am using mongo version 3.6.5 and getting similar messages  ....

 

018-08-02T09:55:39.808-0400 I NETWORK [thread64] Starting new replica set monitor for ReplicaSet/host1:27017,host2:27017,host3:27017
2018-08-02T09:55:39.820-0400 I NETWORK [thread64] Successfully connected to host3:27017 (10333 connections now open to host3:27017 with a 0 second timeout)
2018-08-02T09:55:39.823-0400 I NETWORK [thread64] scoped connection to host3:27017 not being returned to the pool
2018-08-02T09:55:39.823-0400 I NETWORK [thread64] Starting new replica set monitor for ReplicaSet/host1:27017,host2:27017,host3:27017
2018-08-02T09:55:39.844-0400 I NETWORK [thread64] Successfully connected to host3:27017 (10334 connections now open to host3:27017 with a 0 second timeout)
2018-08-02T09:55:39.846-0400 I NETWORK [thread64] scoped connection to host3:27017 not being returned to the pool
2018-08-02T09:55:39.846-0400 I NETWORK [thread64] Starting new replica set monitor for ReplicaSet/host1:27017,host2:27017,host3:27017
2018-08-02T09:55:39.865-0400 I NETWORK [thread64] Successfully connected to host3:27017 (10335 connections now open to host3:27017 with a 0 second timeout)
2018-08-02T09:55:39.867-0400 I NETWORK [thread64] scoped connection to host3:27017 not being returned to the pool
2018-08-02T09:55:39.869-0400 I NETWORK [thread64] Starting new replica set monitor for ReplicaSet/host1:27017,host2:27017,host3:27017
2018-08-02T09:55:39.878-0400 I NETWORK [thread64] Successfully connected to host3:27017 (10336 connections now open to host3:27017 with a 0 second timeout)
2018-08-02T09:55:39.880-0400 I NETWORK [thread64] scoped connection to host3:27017 not being returned to the pool

 

 

Comment by Madhu N [ 02/Aug/18 ]

I Have same issue in my environment.

 

I  just stopped the services and added some additional disk. restarted the services with new path after copying  files to new disk.

Some time after starting it is going to be shutdown.

Logs here:

Thu Aug 2 11:38:57.257 I NETWORK [thread10] Primary-one:27017 (88 connections now open to primary-one:27017 with a 0 second timeout)
Thu Aug 2 11:38:57.344 I NETWORK [thread10] scoped connection to primary-one:27017 not being returned to the pool
Thu Aug 2 11:40:32.577 I CONTROL [signalProcessingThread] got signal 1 (Hangup), will terminate after current cmd ends
Thu Aug 2 11:40:32.578 I NETWORK [signalProcessingThread] shutdown: going to close listening sockets...
Thu Aug 2 11:40:32.578 I NETWORK [signalProcessingThread] removing socket file: /tmp/mongodb-27017.sock
Thu Aug 2 11:40:32.579 I REPL [signalProcessingThread] shutdown: removing all drop-pending collections...
Thu Aug 2 11:40:32.579 I REPL [signalProcessingThread] shutdown: removing checkpointTimestamp collection...
Thu Aug 2 11:40:32.579 I REPL [signalProcessingThread] shutting down replication subsystems
Thu Aug 2 11:40:32.614 I FTDC [signalProcessingThread] Shutting down full-time diagnostic data capture
Thu Aug 2 11:40:32.617 I STORAGE [signalProcessingThread] WiredTigerKVEngine shutting down
Thu Aug 2 11:40:33.269 I STORAGE [signalProcessingThread] WiredTiger message [1533202833:269028][16758:0x7ff0f7523700], txn-recover: Main recovery loop: starting at 14/21376
Thu Aug 2 11:40:33.521 I STORAGE [signalProcessingThread] WiredTiger message [1533202833:521737][16758:0x7ff0f7523700], txn-recover: Recovering log 14 through 15
Thu Aug 2 11:40:33.656 I STORAGE [signalProcessingThread] WiredTiger message [1533202833:656274][16758:0x7ff0f7523700], txn-recover: Recovering log 15 through 15
Thu Aug 2 11:40:33.733 I STORAGE [signalProcessingThread] WiredTiger message [1533202833:733967][16758:0x7ff0f7523700], txn-recover: Set global recovery timestamp: 0
Thu Aug 2 11:40:34.308 I NETWORK [conn6] end connection server:65304 (6 connections now open)
Thu Aug 2 11:40:34.677 I STORAGE [signalProcessingThread] shutdown: removing fs lock...
Thu Aug 2 11:40:34.677 I CONTROL [signalProcessingThread] now exiting
Thu Aug 2 11:40:34.677 I CONTROL [signalProcessingThread] shutting down with code:0

Did we find any fix for this?!

Best Regards

 

Comment by Mira Carey [ 21/Jun/18 ]

After looking a bit, I believe this is actually related to some of the Logical Sessions code introduced in 3.6.

Following the repro (spinning up a minimal 3 node replset) and watching the logs, threads with the name "threadN" pop up with similar warnings. Attaching gdb showed these to be PeriodicRunnerASIO threads running background tasks for the logical sessions refresh.

The relevant code lives at: sessions_collection_rs.cpp

template <typename Callback>
auto sendToPrimary(OperationContext* opCtx, Callback callback)
    -> decltype(std::declval<Callback>()(static_cast<DBClientBase*>(nullptr))) {
    boost::optional<ScopedDbConnection> conn;
    auto res = makePrimaryConnection(opCtx, &conn);
    if (!res.isOK()) {
        return res;
    }
 
    return callback(conn->get());
}

The mistake we're making is in not calling done() on the connection before letting the ScopedDbConnection go out of scope. When connects are returned in that way, we don't know their status, so we have to toss them. Calling done() fixes the problem and uses pooled connections for these operations.


I don't believe this is a critical operational issue. That codepath does clean up those connections (we don't leak them), we just lose track of some book keeping. Definitely worth fixing, but probably not a cause of production instability

Comment by Erik Wramner [ 16/May/18 ]

The servers haven't crashed since I wrote the comment. I will create a new ticket on the next crash. I upgraded all packages (yum upgrade) and rebooted the servers last time, so perhaps that helped? However, the noise in the logs about connections not being returned to the pool with ever increasing numbers of connections "now open" is still there.

Comment by Kelsey Schubert [ 11/May/18 ]

erik.wramner, would you please open a new SERVER ticket for the issue that you describe? It would be helpful for us to see the mongod logs and an archive of diagnostic.data directory of the mongod that is crashing.

Thank you,
Kelsey

Comment by Erik Wramner [ 11/May/18 ]

We are getting this on 3.6.4-1 on CentOS 7 kernel 3.10.0-693.21.1.el7.x86_64 in Azure. As far as I can determine we had it with 3.6.2 as well. Starting recently we also have regular server crashes where the mongod process dies without any clear reason on at least a daily basis. I don't know if it is related in any way, but apart from the sudden death this is what I can find in the logs. We have three servers in the replica set and the logs are just like the ones above. I would be very interested in getting this fixed.

Comment by Mouhcine [ 26/Mar/18 ]

Hello Ramon,
I just uploaded the logs for the primary and the two secondaries. I upgraded from version 3.6.2 to 3.6.3 and this never happened before in my system test.
primary.txt secondary-1.txt secondary-2.txt Thank you for your time and help.
Note: MongoDB is running on ubuntu 16.04 servers on AWS.
Mouhcine

Comment by Ramon Fernandez Marina [ 26/Mar/18 ]

mmakroume, these messages are informational; they indicate that a connection cannot be reused and is being terminated. What version did you upgrade from?

Also it would be good to see more details from both nodes, to see if they provide a clue as of to why this is happening. Can you please upload logs for both nodes in your replica set spanning enough time to show multiple incidents of these messages?

Thanks,
Ramón.

Generated at Thu Feb 08 04:35:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.