[GODRIVER-2024] Connection pool, long semaphore wait causes connection close Created: 25/May/21  Updated: 02/Sep/21  Resolved: 26/Jul/21

Status: Closed
Project: Go Driver
Component/s: Connections
Affects Version/s: 1.5.2
Fix Version/s: None

Type: Improvement Priority: Critical - P2
Reporter: Aaron Kahn Assignee: Matt Dale
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
Related
is related to GODRIVER-2037 Don't clear the connection pool on Co... Closed
is related to GODRIVER-2038 Use "ConnectionTimeout" for creating ... Closed
Epic Link: Connection pool improvements
Case:
Documentation Changes: Needed
Documentation Changes Summary:

The proposed solution is to make two changes, both need additional documentation:
1. Metrics for connection pool checkout duration (semaphore wait time)
2. A new configuration tracking minimum connection io duration or maximum connection checkout (semaphore wait) duration.


 Description   

Problem

We found the driver unnecessarily closes connections and clears the connection pool under high load.

This occurs when the semaphore wait time to acquire a connection approaches the context timeout. If a connection is acquired with little to no context deadline left the connection is closed as any use of the connection results in a timeout. After the connection is closed other go routines will attempt to open a connection with a similarly low deadline; when the new connection fails to create, the entire pool is cleared (generation iterated). This non-virtuous cycle repeats and both increases error rates and cluster cpu (to serve creating the new connections).

Proposed Solution

  1. Publish metrics for connection pool checkout duration (semaphore wait time)
  2. Prevent closing connections when remaining deadline is below a threshold. This can be accomplished in one of a few ways:
    1. Add a client option for minimum connection io duration. After acquiring a connection if the context has a deadline and the remaining duration is below the minimum connection io duration fail fast before attempting to use the connection.
    2. Add a client option for maximum connection pool checkout duration (semaphore wait duration). If the context has a deadline and the deadline is greater than the maximum checkout duration, call acquire with a new context with a deadline equal to the maximum semaphore wait time.

example error pattern:

time="2021-05-24T14:29:24-07:00" level=info msg=mongo_pool_event activity=true connection_id=0 reason=timeout type=ConnectionCheckOutFailedSemaphore
time="2021-05-24T14:29:24-07:00" level=info msg=mongo_pool_event activity=true connection_id=0 reason="ProcessHandshakeError: connection() error occured during connection handshake: context deadline exceeded" type=ConnectionPoolCleared

A way to replicate this problem locally is to run a script with high concurrency, low timeout and low maximum connection pool count.



 Comments   
Comment by Matt Dale [ 26/Jul/21 ]

Update:

GODRIVER-2038 and GODRIVER-2078 are in-progress and should resolve the problematic driver behavior described in our communications with the Tesla team. The Tesla team is currently unblocked by running on their own fork with changes that create all connections in the background. I'm closing this ticket because we have a resolution path tracked in other tickets.

Comment by Matt Dale [ 04/Jun/21 ]

akahn@tesla.com Thanks for getting back to me on those questions!

Comment by Aaron Kahn [ 03/Jun/21 ]

Matt,
I didn't address your questions, answers in order:
1. The configuration of our production application that experienced the outage is context timeout 5s.
2. Yes we experience timeouts, but they are rare, on the order of 1-5 per day
3. Atlas hosted
4. Yes we are using TLS connections

Comment by Aaron Kahn [ 03/Jun/21 ]

@matt.dale Thanks for the follow up and creating additional issues. Good suggestion on MinPoolSize, we have used that in the past, but for the app that caused the reported incident above we did not have a minimum set.
We've experienced this issue with both 20 max pool and 100 max pool. I'm on an unrelated problem until late tomorrow, and will be back to working on the driver changes on Friday. I'll include you on the PR, and let you know if I run into any blockers with tests, etc.

Comment by Matt Dale [ 03/Jun/21 ]

akahn@tesla.com I have a few questions to help me better understand your use case:

  • What are the Context timeout durations on the operations that cause the most timeout errors? E.g. 500ms, 1s, 10s?
  • Do you typically encounter some MongoDB operation timeouts when your application/service is operating normally, or do you typically encounter no MongoDB timeouts unless something is abnormal?
  • Is your MongoDB server self-hosted or Atlas-hosted?
  • Are you using TLS connections?

As far as mitigations, I recommend setting the client minPoolSize to a value greater than 0. Setting minPoolSize starts a background goroutine that runs once per minute and attempts to maintain at least the configured number of connections in the pool, creating new connections until minPoolSize is reached. The maintenance goroutine creates connections using the timeout configured with SetConnectTimeout or the default connection timeout (30 seconds).

E.g. client configuration:

clientOpts := options.
	Client().
	ApplyURI(uri).
	SetMaxPoolSize(20).
	SetMinPoolSize(10)  // Set the min connection pool size.
client, err := mongo.Connect(context.Background(), clientOpts)

Additionally, I've created two new tickets that describe driver improvements to reduce the impact of driver-side operation timeouts on the connection pool:

  • GODRIVER-2037 - Don't clear the connection pool on Context timeout during handshake
  • GODRIVER-2038 - Use "ConnectionTimeout" for creating all new connections and background connection creation

I think the changes to accomplish GODRIVER-2038 may be similar to the refactor you proposed using channels and async connection creation. If you're still working on that improvement, please continue and include me as a reviewer on any PRs.

Comment by Matt Dale [ 28/May/21 ]

akahn@tesla.com I've been able to reproduce a similar problem with many connections being closed due to operation timeouts and then new connections being created with the timeout of the operation context instead of a separate connection timeout. I'm still investigating the reason behind the connections being closed in the first place and looking for any short-term mitigations, but I agree that the design of the connection pool could be significantly improved. Please tag me on any PRs you open, I'd be happy to review them.

Thanks!

Comment by Aaron Kahn [ 28/May/21 ]

Matt,
Unfortunately, we had another incident https://support.mongodb.com/case/00781659 caused by the driver that my proposed solution above would not have mitigated. I believe there is a fundamental design flaw with the connection pool in that it will attempt to spin up connections using the query context. I'm taking a stab at rewriting the connection pool management using channels, and purely async connection creation. My plan is to PR the change from the teslamotors fork back to mainline, but the change is somewhat large. If you're up for a preliminary review of a draft, let's sync up mid next week.

-Aaron

Comment by Matt Dale [ 28/May/21 ]

akahn@tesla.com thanks for reporting this issue! We're looking into it and will let you know if we have any questions.

Generated at Thu Feb 08 08:37:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.