[SERVER-35010] LDAP failover/failback selection is suboptimal Created: 16/May/18  Updated: 27/Oct/23  Resolved: 10/Jun/19

Status: Closed
Project: Core Server
Component/s: Networking, Security
Affects Version/s: 3.6.3
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Luke Prochazka Assignee: Backlog - Security Team
Resolution: Gone away Votes: 9
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-34260 Ability to reuse a single TCP connect... Closed
Assigned Teams:
Server Security
Participants:
Case:

 Description   

Undesirable behaviour has been observed with respect to LDAP server failover and failback.  The reproduction case indicates that one of the failure modes leads to undesirable behaviour and is fairly suboptimal.

I suggest this stems from the root issue that the mongod has no notion of LDAP server availability.  There is no keepalive or heartbeat, nor any reasonable attempt to load balance requests across multiple LDAP servers as the primary server is overwhelmingly preferred (even in the event of failure).



 Comments   
Comment by Jonathan Reams [ 11/Feb/19 ]

luke.prochazka, here's a summary of all the changes to connection handling in SERVER-34260. The underlying implementation of the connection pool used by SERVER-34260 is the same one used by mongos to talk to shards in a sharded cluster.

There is no option to round-robin requests across all available LDAP servers; the first LDAP server whose connection succeeds (either because it connected fastest or there was a pooled idle connection available) is the one that gets used.

Idle connections are periodically refreshed in the background by running an empty RootDSE query - so you can be reasonably sure that an LDAP connection that is being used for auth is healthy before it gets used.

If a connection encounters an error either during a refresh or during normal use, it does not get returned to the pool, all of the existing connections to that host are assumed to be bad and dropped, and a new connection attempt is started in the background.

Once an LDAP server recovers and is available again, it will start being used automatically by the connection pool.

If the customer is using round-robined A records then there won't be any improvement for connection failover compared to the non-pooled implementation, except that connections will be reused as long as there are no connection issues.

Comment by Dayo Lasode [ 25/Jan/19 ]

I  think that proposed behavior should fix this (subject to further tests)

My assumption is that if  connections in the pool are stale e.g. because an initially selected  LDAP server is down or a timeout threshold is surpassed, it's  refreshed again based on whichever LDAP servers are still online and steps (1) to (3)  in the auth process are repeated, with the same connection?

I'm assuming those 3 steps  are  sequential  

Thanks

Comment by Jonathan Reams [ 24/Jan/19 ]

This may be improved by SERVER-34260 which implements a connection pool and attempts to connect to all the hosts in the URI at once and returns the first one that actually succeeds.

Generated at Thu Feb 08 04:38:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.