Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.3.0, 4.4.15, 5.0.15
Affects Version/s: None
Component/s: None
Labels:
- servicearch-q4-2021
- servicearch-wfbf-day

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.0, v4.4
Steps To Reproduce:

Hide

[On v4.4.1] Just starting 3 mongod's on localhost with different ports, configuring them into a replica set with chaining disabled, and sending SIGSTOP to the primary reproduces this issue (specifically, the second case) frequently on my machine (the unstopped node that remains as a secondary continues to list and use the "dead" node as it's sync source until it rejoins the set).

Show
[On v4.4.1] Just starting 3 mongod's on localhost with different ports, configuring them into a replica set with chaining disabled, and sending SIGSTOP to the primary reproduces this issue (specifically, the second case) frequently on my machine (the unstopped node that remains as a secondary continues to list and use the "dead" node as it's sync source until it rejoins the set).
Sprint:
Service Arch 2021-11-15, Service Arch 2021-11-22, Service Arch 2021-12-13, Service Arch 2022-1-10
Case:
Linked BF Score:
19
Story Points:
5
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When the oplog fetcher cannot retrieve a batch of oplog documents from it's sync source, it will first try and create a new cursor on that sync source before choosing a new one (see the code here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/db/repl/oplog_fetcher.cpp#L493)

In doing so, it will first attempt to re-establish the connection to the sync source using it's existing DBClientConnection. But in certain edge-cases, this connection-reestablishment process can make synchronous/blocking networking calls without correctly configured timeouts. In cases where the remote host the networking call is trying to communicate with is completely unresponsive, this can block the oplog fetching thread indefinitely, preventing the selection of a new sync-source. In this case, it is possible for a node to continue to use an unavailable node as its "sync-source" until the unavailable node re-contacts the set, allowing the blocking networking call to finish and the oplog fetcher to shut down.

I found two cases so far where these calls seem to be made:

In the first case, we perform a blocking SSL handshake without a timeout configured on the socket; if the remote node becomes unavailable between establishing the connection and the SSL handshake, we'll see this issue.

In DBClientConnection::_checkConnection(), between the "Trying to reconnect" log line and the "Reconnect attempt failed" log line, the thread attempting to re-establish connection to the sync source (node 01) calls DBClientConnection::connect() (see the source code here).
In trying to re-establish the connection, this thread first calls DBClientConnection::connectSocketOnly(), which in turn calls TransportLayerAsio::connect(). This function delegates to the ASIO library to establish the connection, and then creates an ASIOSession to manage it. As Atlas uses SSL to ensure connection safety, TransportLayerAsio::connect() will then attempt an SSL handshake here, also done through ASIO. However, the SSL handshake is attempted synchronously, and no timeout is set on the socket before the attempt (see the code for ensureSync here), so the handshake attempt will block until it errors or completes. This means that if the remote server (in this case node 01) stops responding to network requests (as it does here), the handshake attempt will block until the remote node comes back up and is able to respond. This explains why when the remote host (node 01) comes back online around 2020-12-11T07:43:05, node 00 is able to progress – the FTDC metrics that a slow SSL op is logged at this time on node 00, and expectedly, the connection has timed out and we see this error propogated from connectSocketOnly up through DBClientConnection::connect --> DBClientConnection::_checkConnection and eventualy DBClientCursor::init into the logs.

A second similar case I've been able to reproduce locally:

When the oplog fetcher creates a new cursor here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/db/repl/oplog_fetcher.cpp#L647 , the old one falls out of scope and is destroyed. The call stack below ensues (I've abbreviated parts):

in recvmsg()
...
in asio::read
in mongo::transport::TransportLayerAsio::AsioSession::opportunisticRead
in mongo::transport::TransportLayerAsio::AsioSession::read
in mongo::transport::TransportLayerAsio::AsioSession::sourceMessageImpl
in mongo::transport::TransportLayerAsio::AsioSession::sourceMessage
in mongo::DBClientConnection::call
in mongo::DBClientBase::runCommandWithTarget
in mongo::DBClientConnection::runCommandWithTarget
in mongo::DBClientBase::runCommand
in mongo::(DBClientConnection)::initWireVersion
in mongo::DBClientConnection::connect
in mongo::ConnectionString::connect
....
in mongo ScopedDbConnection constructor
in mongo::DBClientCursor::kill
in mongo::DBClientCursor destructor

In this code, the ScopedDbConnection constructor here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/client/connpool.h#L471 constructs a DbClientConnection with a socket timeout of 0, which translates here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/client/dbclient_connection.cpp#L610-L611 into the DBClientConnection storing the value for the socket timeout as boost::none. When it sets that timeout on the ASIOSession here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/transport/session_asio.h#L204, the ASIOSession stores boost::none as the timeout. This means that when ASIOSession::sourceMessage is called, no timeout is set on the socket before the read is performed (see https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/transport/session_asio.h#L302). This allows the read to block indefinitely until the node comes back online.

I've repro'd the issue on 4.4.1, but we should also audit the main branch for this issue – it doesn't look like the relevant code has changed.

Next steps:

Write a test to reproduce the issue described in this ticket

Determine the level at which level to solve this issue

is related to

SERVER-66456 remove ThreadContext

Closed

related to

SERVER-62035 Investigate large delays in `CertGetCertificateChain` on Windows

Backlog

SERVER-62665 Ensure `TransportLayerASIO::TimerService` can safely end the session

Closed

SERVER-65416 Complete TODO listed in SERVER-30212

Open

SERVER-62199 Use the `TimerService` to ensure `TransportLayer` honors requested timeouts

Backlog

mentioned in: Page Loading...

(1 mentioned in)

Assignee:: Amirsaman Memaripour
Reporter:: George Wangensteen (Inactive)
Participants:: Adam Rayner, Amirsaman Memaripour, George Wangensteen, Githook User, Jack Wearden
Votes:: 0 Vote for this issue
Watchers:: 23 Start watching this issue

Created:: Mar 02 2021 11:34:36 PM UTC
Updated:: Jun 17 2025 06:42:14 PM UTC
Resolved:: Dec 21 2021 06:00:14 PM UTC
Confidence Status Last Update:: 11/Nov/21 4:53 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates