[SERVER-35010] LDAP failover/failback selection is suboptimal Created: 16/May/18 Updated: 27/Oct/23 Resolved: 10/Jun/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking, Security |
| Affects Version/s: | 3.6.3 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Luke Prochazka | Assignee: | Backlog - Security Team |
| Resolution: | Gone away | Votes: | 9 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Server Security
|
||||||||
| Participants: | |||||||||
| Case: | (copied to CRM) | ||||||||
| Description |
|
Undesirable behaviour has been observed with respect to LDAP server failover and failback. The reproduction case indicates that one of the failure modes leads to undesirable behaviour and is fairly suboptimal. I suggest this stems from the root issue that the mongod has no notion of LDAP server availability. There is no keepalive or heartbeat, nor any reasonable attempt to load balance requests across multiple LDAP servers as the primary server is overwhelmingly preferred (even in the event of failure). |
| Comments |
| Comment by Jonathan Reams [ 11/Feb/19 ] |
|
luke.prochazka, here's a summary of all the changes to connection handling in There is no option to round-robin requests across all available LDAP servers; the first LDAP server whose connection succeeds (either because it connected fastest or there was a pooled idle connection available) is the one that gets used. Idle connections are periodically refreshed in the background by running an empty RootDSE query - so you can be reasonably sure that an LDAP connection that is being used for auth is healthy before it gets used. If a connection encounters an error either during a refresh or during normal use, it does not get returned to the pool, all of the existing connections to that host are assumed to be bad and dropped, and a new connection attempt is started in the background. Once an LDAP server recovers and is available again, it will start being used automatically by the connection pool. If the customer is using round-robined A records then there won't be any improvement for connection failover compared to the non-pooled implementation, except that connections will be reused as long as there are no connection issues. |
| Comment by Dayo Lasode [ 25/Jan/19 ] |
|
I think that proposed behavior should fix this (subject to further tests) My assumption is that if connections in the pool are stale e.g. because an initially selected LDAP server is down or a timeout threshold is surpassed, it's refreshed again based on whichever LDAP servers are still online and steps (1) to (3) in the auth process are repeated, with the same connection? I'm assuming those 3 steps are sequential Thanks |
| Comment by Jonathan Reams [ 24/Jan/19 ] |
|
This may be improved by |