[SERVER-54900] Blocking networking calls can delay sync-source resolution indefinitely Created: 02/Mar/21  Updated: 29/Oct/23  Resolved: 21/Dec/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.3.0, 4.4.15, 5.0.15

Type: Bug Priority: Major - P3
Reporter: George Wangensteen Assignee: Amirsaman Memaripour
Resolution: Fixed Votes: 0
Labels: servicearch-q4-2021, servicearch-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-62035 Investigate large delays in `CertGetC... Backlog
related to SERVER-62665 Ensure `TransportLayerASIO::TimerServ... Closed
related to SERVER-65416 Complete TODO listed in SERVER-30212 Open
related to SERVER-62199 Use the `TimerService` to ensure `Tra... Backlog
is related to SERVER-66456 remove ThreadContext Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0, v4.4
Steps To Reproduce:

[On v4.4.1] Just starting 3 mongod's on localhost with different ports, configuring them into a replica set with chaining disabled, and sending SIGSTOP to the primary reproduces this issue (specifically, the second case) frequently on my machine (the unstopped node that remains as a secondary continues to list and use the "dead" node as it's sync source until it rejoins the set). 

Sprint: Service Arch 2021-11-15, Service Arch 2021-11-22, Service Arch 2021-12-13, Service Arch 2022-1-10
Participants:
Case:
Linked BF Score: 19
Story Points: 5

 Description   

When the oplog fetcher cannot retrieve a batch of oplog documents from it's sync source, it will first try and create a new cursor on that sync source before choosing a new one (see the code here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/db/repl/oplog_fetcher.cpp#L493) 

In doing so, it will first attempt to re-establish the connection to the sync source using it's existing DBClientConnection. But in certain edge-cases, this connection-reestablishment process can make synchronous/blocking networking calls without correctly configured timeouts. In cases where the remote host the networking call is trying to communicate with is completely unresponsive, this can block the oplog fetching thread indefinitely, preventing the selection of a new sync-source. In this case, it is possible for a node to continue to use an unavailable node as its "sync-source" until the unavailable node re-contacts the set, allowing the blocking networking call to finish and the oplog fetcher to shut down. 

 

I found two cases so far where these calls seem to be made: 

In the first case, we perform a blocking SSL handshake without a timeout configured on the socket; if the remote node becomes unavailable between establishing the connection and the SSL handshake, we'll see this issue.

In DBClientConnection::_checkConnection(), between the "Trying to reconnect" log line and the "Reconnect attempt failed" log line, the thread attempting to re-establish connection to the sync source (node 01) calls DBClientConnection::connect() (see the source code here).  
In trying to re-establish the connection, this thread first calls DBClientConnection::connectSocketOnly(), which in turn calls TransportLayerAsio::connect(). This function delegates to the ASIO library to establish the connection, and then creates an ASIOSession to manage it. As Atlas uses SSL to ensure connection safety, TransportLayerAsio::connect() will then attempt an SSL handshake here, also done through ASIO. However, the SSL handshake is attempted synchronously, and no timeout is set on the socket before the attempt (see the code for ensureSync here), so the handshake attempt will block until it errors or completes. This means that if the remote server (in this case node 01) stops responding to network requests (as it does here), the handshake attempt will block until the remote node comes back up and is able to respond. This explains why when the remote host (node 01) comes back online around 2020-12-11T07:43:05, node 00 is able to progress – the FTDC metrics that a slow SSL op is logged at this time on node 00, and expectedly, the connection has timed out and we see this error propogated from connectSocketOnly up through DBClientConnection::connect --> DBClientConnection::_checkConnection and eventualy DBClientCursor::init into the logs.

A second similar case I've been able to reproduce locally:

When the oplog fetcher creates a new cursor here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/db/repl/oplog_fetcher.cpp#L647 , the old one falls out of scope and is destroyed. The call stack below ensues (I've abbreviated parts):

in recvmsg()
...
in asio::read
in mongo::transport::TransportLayerAsio::AsioSession::opportunisticRead
in mongo::transport::TransportLayerAsio::AsioSession::read
in mongo::transport::TransportLayerAsio::AsioSession::sourceMessageImpl
in mongo::transport::TransportLayerAsio::AsioSession::sourceMessage
in mongo::DBClientConnection::call
in mongo::DBClientBase::runCommandWithTarget
in mongo::DBClientConnection::runCommandWithTarget
in mongo::DBClientBase::runCommand
in mongo::(DBClientConnection)::initWireVersion
in mongo::DBClientConnection::connect
in mongo::ConnectionString::connect
....
in mongo ScopedDbConnection constructor
in mongo::DBClientCursor::kill
in mongo::DBClientCursor destructor

In this code, the ScopedDbConnection constructor here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/client/connpool.h#L471 constructs a DbClientConnection with a socket timeout of 0, which translates here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/client/dbclient_connection.cpp#L610-L611 into the DBClientConnection storing the value for the socket timeout as boost::none. When it sets that timeout on the ASIOSession here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/transport/session_asio.h#L204, the ASIOSession stores boost::none as the timeout. This means that when ASIOSession::sourceMessage is called, no timeout is set on the socket before the read is performed (see https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/transport/session_asio.h#L302). This allows the read to block indefinitely until the node comes back online.

I've repro'd the issue on 4.4.1, but we should also audit the main branch for this issue – it doesn't look like the relevant code has changed.

 

Next steps:

Write a test to reproduce the issue described in this ticket 

Determine the level at which level to solve this issue 

 



 Comments   
Comment by Jack Wearden [ 25/Jan/23 ]

Thanks for adding the backport requests. Is it too late to also request backport to 4.2?

Comment by Githook User [ 05/Dec/22 ]

Author:

{'name': 'Amirsaman Memaripour', 'email': 'amirsaman.memaripour@mongodb.com', 'username': 'samanca'}

Message: SERVER-54900 Cancel ASIO session when SSL handshake times out

(cherry picked from commit 263c0631c8001a8cdb42aff720b8a49a621754dd)
Branch: v5.0
https://github.com/mongodb/mongo/commit/99f9a9588825842040c945a941d2b34554b9bea9

Comment by Adam Rayner [ 16/Sep/22 ]

Requesting backport to 5.0 - this was backported to 4.4 and is present in 5.3+, so should probably also be backported to 5.0.

Comment by Githook User [ 08/Jun/22 ]

Author:

{'name': 'Amirsaman Memaripour', 'email': 'amirsaman.memaripour@mongodb.com', 'username': 'samanca'}

Message: SERVER-54900 Cancel ASIO session when SSL handshake times out

(cherry picked from commit 263c0631c8001a8cdb42aff720b8a49a621754dd)
Branch: v4.4
https://github.com/mongodb/mongo/commit/65134ea4e05c6c6d2647186782775ae0cdeb712e

Comment by Githook User [ 21/Dec/21 ]

Author:

{'name': 'Amirsaman Memaripour', 'email': 'amirsaman.memaripour@mongodb.com', 'username': 'samanca'}

Message: SERVER-54900 Cancel ASIO session when SSL handshake times out
Branch: master
https://github.com/mongodb/mongo/commit/263c0631c8001a8cdb42aff720b8a49a621754dd

Generated at Thu Feb 08 05:34:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.