[SERVER-26740] Total connections is not stable and spiky when user authentication is enabled Created: 24/Oct/16  Updated: 08/Feb/23  Resolved: 09/Jun/17

Status: Closed
Project: Core Server
Component/s: Networking, Security
Affects Version/s: 3.2.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: 아나 하리 Assignee: Spencer Jackson
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Connections.png     PNG File Query.png     File diagnostic.data.tar.gz    
Issue Links:
Duplicate
duplicates SERVER-26952 Cache SCRAM-SHA-1 ClientKey Closed
Related
is related to SERVER-25027 Configurable connection pools size fo... Closed
Operating System: ALL
Participants:
Case:

 Description   

MongoDB cluster is consist of 28 shards and 15 mongos.
There's about 4K/sec queries per shard and small DMLs and there's about 600~1000 connections(from mongos) per shard.

MongoDB connections graph is spiky only when user authentication is enabled Not with disabled authentication. And java client driver return errors(like below) or slow-responses when user authentication is enabled. (I attached connections and query/sec graphs).

[ERROR] [c.k.s.m.g.r.RequestExecuteCallable] execute (71): Too many threads are already waiting for a connection. Max number of threads (maxWaitQueueSize) of 400 has been exceeded.
com.mongodb.MongoWaitQueueFullException: Too many threads are already waiting for a connection. Max number of threads (maxWaitQueueSize) of 400 has been exceeded.

Not exactly 5 minutes, but connection spike(and client queue error) is happened about 5 minutes intervals.
I am not sure "ConnectionPool::kDefaultHostTimeout" cause this issue yet, But I think 5 minutes is too short to shutdown connection pool. it seems so aggressive. Is there any reason to choose 5 minutes to shutdown pool ?

https://github.com/mongodb/mongo/blob/master/src/mongo/executor/connection_pool.cpp#L177



 Comments   
Comment by Kelsey Schubert [ 09/Jun/17 ]

Hi matt.lee,

We've seen significant improvement for users running with SERVER-26952. This fix is included in MongoDB 3.4.4 and later, and we're evaluating the backport to MongoDB 3.2. Please watch SERVER-26952 for updates.

Kind regards,
Thomas

Comment by Mira Carey [ 09/Nov/16 ]

matt.lee,

I had a conversation with an engineer on our security team and he had an idea he's written down in SERVER-26952 (Caching scram credentials for connection pooling). That change is very likely to help mitigate the kind of situation you've seen in this ticket.

While it won't make it for 3.4.0, it's an important change that I'll definitely be considering for a backport.

Comment by Mira Carey [ 08/Nov/16 ]

matt.lee,

My apologies, I hadn't updated the description of SERVER-25027 yet. That ticket was re-purposed for providing the actual options available in mongos, and it will cover minConnections and hostTimeout.

Comment by 아나 하리 [ 08/Nov/16 ]

Thanks Jason.

So you think connection spike is expected when client authentication is enabled.

And,
I think we need to knob to control minConnection and hostTimeout of connection pool (SpecificPool) in mongos rather than SERVER-25027.
Like I've submitted in SERVER-26720.

Regards,
Matt.

Comment by Mira Carey [ 07/Nov/16 ]

matt.lee,

Regarding the HostTimeoutMS, that represents the number of milliseconds to keep a pool alive, and min connections (default 1) open without activity. So if you're seeing spikes in this way, it would be for hosts you're not actively using.

In general, it's not surprising that auth would change the cost of these kind of pool evictions (as it adds a large fixed overhead to connection establishment, by design). One pattern you could be in might be:

All of this in mongos:

  1. Burst of connections to a host
  2. no activity
  3. connections drop down to minconnections (1) after 60 seconds (the point at which we have to start heart beating a node to keep the connections live)
  4. connections drop to 0 after another 4 minutes

SERVER-25027 adds these parameters as exported server params, and we intend to backport these knobs to 3.2. That would offer one possible way forward (tuning up all the various timeouts) by trading off holding more open connections for longer in return for smoothing connection spikes.

In the meanwhile, your best mitigation would be to keep some level of regular traffic going to those hosts.

Regards,
Jason

Comment by 아나 하리 [ 07/Nov/16 ]

Hi Thomas.

I uploaded mongod and mongos log files (Not for all mongos and mongod, only one mongos and mongod server's).

18:30 ~ 19:00 (+09:00) :: auth enabled test
20:00 ~ 20:30 (+09:00) :: auth disabled test

Regards,
Matt.

Comment by Kelsey Schubert [ 04/Nov/16 ]

Hi matt.lee,

I've created a secure upload portal for you to use here. Files uploaded to this portal are visible only to MongoDB employees investigating this issue and are routinely deleted after some time.

Would you please let us know when you have completed the upload so we can continue to investigate this issue? We'll continue to investigate this issue here to determine whether there is an appropriate code change that would improve the behavior you are observing.

Thank you,
Thomas

Comment by 아나 하리 [ 04/Nov/16 ]

Andreas Nilsson,

Is there any way to send log files to you by email or some other ways.
Log files are too big to attach this thread. (And also don't want to be public)

Regards,
Matt.

Comment by Andreas Nilsson [ 03/Nov/16 ]

matt.lee do you have any logs from mongos/mongod during the time of spiky connections? It's hard to say anything using only the diagnostics data.

Thanks,
Andreas

Comment by Andreas Nilsson [ 02/Nov/16 ]

Hi matt.lee, sorry for not getting to this earlier. It is on our queue and we will look at the metrics file tomorrow.

Best,
Andreas

Comment by 아나 하리 [ 02/Nov/16 ]

Hi...

Is there anyone who look into this case.
Sorry for pushing you guys, But it's kind of urgent for me.

Regards,
Matt.

Comment by 아나 하리 [ 25/Oct/16 ]

We have restarted a few times during test.
I think the metrics data you want is stored in "metrics.2016-10-24T10-09-08Z-00000" files.

Thanks.

Comment by Ramon Fernandez Marina [ 24/Oct/16 ]

matt.lee, can you please upload the logs of a mongos and a mongod during a period of spiky connections? I'd also like to take a look at the contents of the diagnostic.data from a mongod inside one of the shards.

I'm looking to see where those spiky connections come from, so please choose a representative mongod; for example, if you're doing secondary reads then include logs and diagnostic data from the node that's serving those reads as well as its primary.

Thanks,
Ramón.

Generated at Thu Feb 08 04:13:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.