-
Type: Epic
-
Resolution: Unresolved
-
Priority: Unknown
-
None
-
Component/s: Retryability
-
To Do
-
Drivers should retry authentication errors when connection handshake fails
-
Needed
Summary
We've had a customer with one mongos that couldn't reach the LDAP server (due to a transient network issue) and so failed to authenticate new connections. The other mongos was fine. Can we consider the handshake as failed when external authentication is not possible.
In our repro, blocking the ports to the LDAP server gave an error like:
Caused by: com.mongodb.MongoCommandException: Command failed with error 18 (AuthenticationFailed): 'Authentication failed.' on server 192.168.1.122:27017. The full response is {"ok": 0.0, "errmsg": "Authentication failed.", "code": 18, "codeName": "AuthenticationFailed", "operationTime": {"$timestamp": {"t": 1610717564, "i": 2}}, "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1610717564, "i": 2}}, "signature": {"hash": {"$binary": "VlqG0NMZ2vycHdc1jR1u6Zvika4=", "$type": "00"}, "keyId": {"$numberLong": "6917665203874168863"}}}}
If it could blacklist the failing mongos for X seconds then retry the op via a healthy mongos we'd avoid this specific use case.
Motivation
Who is the affected end user?
Who are the stakeholders?
How does this affect the end user?
Are they blocked? Are they annoyed? Are they confused?
How likely is it that this problem or use case will occur?
Main path? Edge case?
If the problem does occur, what are the consequences and how severe are they?
Minor annoyance at a log message? Performance concern? Outage/unavailability? Failover can't complete?
Is this issue urgent?
Does this ticket have a required timeline? What is it?
Is this ticket required by a downstream team?
Needed by e.g. Atlas, Shell, Compass?
Is this ticket only for tests?
Is this ticket have any functional impact, or is it just test improvements?
Cast of Characters
Engineering Lead:
Document Author:
POCers:
Product Owner:
Program Manager:
Stakeholders:
Channels & Docs
Slack Channel
[Scope Document|some.url]
[Technical Design Document|some.url]
- is related to
-
DRIVERS-2247 Add tests for non-retryable handshake errors
- Backlog
- related to
-
DRIVERS-746 Drivers should retry operations if connection handshake fails
- Implementing
-
DRIVERS-1571 Direct read/write retries to another mongos if possible
- Development Complete