[SERVER-36923] Scoped connections not being returned from Logical Sessions Cache machinery Created: 29/Aug/18 Updated: 08/Jan/24 Resolved: 05/Sep/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Blake Oler | Assignee: | Blake Oler |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Sharding 2018-09-10 | ||||||||
| Participants: | |||||||||
| Description |
InvestigationScoped connection errors exist in a custom logical session refresh jsCore passthrough suite. This suite sets the logical session refresh time to 100ms. At the end of each job, secondaries cannot create a connection to the primary because "only one usage of each socket address (protocol/network address/port) is normally permitted." Throughout the job's lifetime, we see instances of "scoped connection" ... "not being returned to the pool" from all nodes. This is the ordering of events I was able to put together from the highlighted lines on the Lobster log for one of the jobs. 1. The LogicalSessionCache Refresh thread on the primary starts a replica set monitor on itself (is this normal?) ProblemWe are letting ScopedDBConnections go out of scope before calling done() – the call to done() places the connection back on the pool. Proposed fixCall done() on the connection before the connection goes out of scope. |
| Comments |
| Comment by Misha Tyulenev [ 31/Aug/18 ] | |
|
Good point blake.oler! while it does not remove the race - i.e. its still possible to get this error by opening the connection to the primary from the reaper and the refreshSessions threads | |
| Comment by Blake Oler [ 31/Aug/18 ] | |
|
It turns out the issue was unrelated to races between the reap/refresh threads. Whenever we send a command to an RS primary from SessionsCollection RS, we drop the reference to the ScopedDBConnection without calling done() on it. Adding the call to done() removes all connection pool errors and the end-of-job socket errors. Whether we should still approach serialization of the reap and refresh threads misha.tyulenev is an open question. Let's discuss this tomorrow. | |
| Comment by Mira Carey [ 30/Aug/18 ] | |
|
I'm happy with adding a mutex to the sessions collection (i.e. synchronizing in that type, rather than adding logic to the cache) | |
| Comment by Misha Tyulenev [ 29/Aug/18 ] | |
|
Looks to me that we have a race between Reap and Refresh threads. Need to synchronize the access and it should be gone. I have updated the description with the proposed fix section. | |
| Comment by Blake Oler [ 29/Aug/18 ] | |
|
Discovered some more odd behavior in another symptomatic suite. Check out this lobster: For all three hosts, two connections to the primary are made almost simultaneously. Each connection thinks it's the first connection and reports only one connection active. This seems to be why one connection isn't being killed properly. This happens once in two places in the Refresh thread (on secondary1), and twice (primary, secondary0) with the Refresh/Reap threads opening a connection simultaneously. I think it has to do with the call to setupSessionsCollection, because the double connect happens soon after the processes stop complaining that
I have an Evergreen build running here to verify this is the case. I haven't achieved any progress with understanding why the primary was restarted in the job analyzed in the ticket description. However, not all jobs analyzed had a node restart, so I'm going to focus on the double connect behavior for now | |
| Comment by Misha Tyulenev [ 29/Aug/18 ] | |
|
blake.oler sessions_collection_rs.cpp need to find out the topoly of RS so it uses RemoteCommandTargeterRS which takes the connection string and either finds the RSM in the existing global pool or create new. Its normal if it happens once per connection string per process (not thread) |