|
For background on this issue see the problems discussed in DRIVERS-1692 and DRIVERS-1707 and related tickets.
A server can become blackholed after unclean OS shutdown or network failure (also called unplanned maintenance). Any operations to that server will become blackholed and the driver will not timeout the operation until socketTimeoutMS, TCP keepalive (3.5 minutes), or TCP retransmission timeout (15 minutes).
DRIVERS-1707 proposes a way to workaround this problem by relying on the more consistent timeout behavior of SDAM health checks. The downside is implementation complexity and the fact that it will not work on serverless/load balancers where SDAM does not run.
DRIVERS-1692 proposes a fix by enabling a shorter TCP retransmission timeout by configuring TCP_USER_TIMEOUT. The downside is that not all languages or OSes support this feature (Java does not support it and neither does Windows AFAICT).
This ticket proposes that we change the MongoDB wire protocol to fix this issue. When the client runs a long running command, we can change the server to stream OP_MSG replies back to the client every X seconds until the command finishes and the server finally sends the final command response. A conversation would work like this:
# Clients that support the protocol send 'keepaliveTimeMS' with the connection handshake.
|
client: {hello: 1, keepaliveTimeMS: 5000}
|
# Servers that support the protocol send 'keepalive' with the connection handshake response.
|
server: {ok: 1, keepalive: 1}
|
# Client initiates long running command.
|
client: {find: 'test', filter: ...}
|
# Server waits for keepaliveTimeMS then responds with a keepalive message (OP_MSG keepaliveFlag=1):
|
server: {ok: 1, keepalive: 1}
|
# Server waits for another keepaliveTimeMS then responds with a keepalive message (OP_MSG keepaliveFlag=1):
|
server: {ok: 1, keepalive: 1, keepaliveFlag:1}
|
... # Repeat until command is done processing
|
|
# Server completes command and sends the final result (OP_MSG keepaliveFlag=0):
|
server: {ok: 1, cursor: {firstBatch: [...]}, keepaliveFlag:0}
|
|
# No extra traffic while the connection is idle.
|
<...idle...>
|
The benefit with this approach is that we don't rely on TCP timeouts and instead can make the client timeout according to 'keepaliveTimeMS'. In order to account for roundtrip time the timeout would likely become connectTimeoutMS + keepaliveTimeMS, i.e.:
if handshake_response.get('keepalive'):
|
# Keepalive protocol supported. Detect dead/blackholed connections sooner
|
socket.settimeout(connectTimeoutMS + keepaliveTimeMS)
|
else:
|
socket.settimeout(socketTimeoutMS)
|
|
# Later, when receiving a command response:
|
while True:
|
res = recv_wire_response(socket)
|
if res.keepaliveFlag:
|
continue # Connection still alive, wait for the next response.
|
return res
|
keepaliveTimeMS would be configureable to support different use cases but a default time of 5 or 10 seconds might be reasonable.
|