[SERVER-36923] Scoped connections not being returned from Logical Sessions Cache machinery Created: 29/Aug/18  Updated: 08/Jan/24  Resolved: 05/Sep/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Blake Oler Assignee: Blake Oler
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-34120 scoped connection not being returned ... Closed
Operating System: ALL
Sprint: Sharding 2018-09-10
Participants:

 Description   

Investigation

Scoped connection errors exist in a custom logical session refresh jsCore passthrough suite. This suite sets the logical session refresh time to 100ms.

At the end of each job, secondaries cannot create a connection to the primary because "only one usage of each socket address (protocol/network address/port) is normally permitted."

Throughout the job's lifetime, we see instances of "scoped connection" ... "not being returned to the pool" from all nodes.

This is the ordering of events I was able to put together from the highlighted lines on the Lobster log for one of the jobs.

1. The LogicalSessionCache Refresh thread on the primary starts a replica set monitor on itself (is this normal?)
2. The primary can’t find the primary (itself) in context of the replica set
3. The primary eventually finds the primary in context of the replica set
3. The primary and both secondaries creates scoped connections to the primary that are being terminated without being in a fail state (…"scoped connection"… "not being returned to the pool")
4. Client requests restart of the primary (I'm not sure if this actually happens, but why else would the CONTROL startup lines be repeated so late in the test?)
5. The LogicalSessionCache Refresh and Reap threads on secondaries now get connpool errors attempting to connect to the primary

Problem

We are letting ScopedDBConnections go out of scope before calling done() – the call to done() places the connection back on the pool.

Proposed fix

Call done() on the connection before the connection goes out of scope.



 Comments   
Comment by Misha Tyulenev [ 31/Aug/18 ]

Good point blake.oler! while it does not remove the race - i.e. its still possible to get this error by opening the connection to the primary from the reaper and the refreshSessions threads

Comment by Blake Oler [ 31/Aug/18 ]

It turns out the issue was unrelated to races between the reap/refresh threads. Whenever we send a command to an RS primary from SessionsCollection RS, we drop the reference to the ScopedDBConnection without calling done() on it. Adding the call to done() removes all connection pool errors and the end-of-job socket errors.

Whether we should still approach serialization of the reap and refresh threads misha.tyulenev is an open question. Let's discuss this tomorrow.

Comment by Mira Carey [ 30/Aug/18 ]

I'm happy with adding a mutex to the sessions collection (i.e. synchronizing in that type, rather than adding logic to the cache)

Comment by Misha Tyulenev [ 29/Aug/18 ]

Looks to me that we have a race between Reap and Refresh threads. Need to synchronize the access and it should be gone. I have updated the description with the proposed fix section.

Comment by Blake Oler [ 29/Aug/18 ]

misha.tyulenev

Discovered some more odd behavior in another symptomatic suite. Check out this lobster:

For all three hosts, two connections to the primary are made almost simultaneously. Each connection thinks it's the first connection and reports only one connection active. This seems to be why one connection isn't being killed properly. This happens once in two places in the Refresh thread (on secondary1), and twice (primary, secondary0) with the Refresh/Reap threads opening a connection simultaneously.

I think it has to do with the call to setupSessionsCollection, because the double connect happens soon after the processes stop complaining that

Sessions collection is not set up; waiting until next sessions refresh interval: Replication has not yet been configured

I have an Evergreen build running here to verify this is the case.

I haven't achieved any progress with understanding why the primary was restarted in the job analyzed in the ticket description. However, not all jobs analyzed had a node restart, so I'm going to focus on the double connect behavior for now

Comment by Misha Tyulenev [ 29/Aug/18 ]

blake.oler sessions_collection_rs.cpp need to find out the topoly of RS so it uses RemoteCommandTargeterRS which takes the connection string and either finds the RSM in the existing global pool or create new. Its normal if it happens once per connection string per process (not thread)
However in the log the line "Starting new replica set monitor for rs/localhost:20750"
looks suboptimal to me because its just one host for a 3 node RS. This is also the reason it cant find the primary. Later RSM discovers the correct string in the isMaster response.

Generated at Thu Feb 08 04:44:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.