Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-54900

Blocking networking calls can delay sync-source resolution indefinitely

    • Fully Compatible
    • ALL
    • v5.0, v4.4
    • Hide

      [On v4.4.1] Just starting 3 mongod's on localhost with different ports, configuring them into a replica set with chaining disabled, and sending SIGSTOP to the primary reproduces this issue (specifically, the second case) frequently on my machine (the unstopped node that remains as a secondary continues to list and use the "dead" node as it's sync source until it rejoins the set). 

      Show
      [On v4.4.1] Just starting 3 mongod's on localhost with different ports, configuring them into a replica set with chaining disabled, and sending SIGSTOP to the primary reproduces this issue (specifically, the second case) frequently on my machine (the unstopped node that remains as a secondary continues to list and use the "dead" node as it's sync source until it rejoins the set). 
    • Service Arch 2021-11-15, Service Arch 2021-11-22, Service Arch 2021-12-13, Service Arch 2022-1-10
    • 19
    • 5

      When the oplog fetcher cannot retrieve a batch of oplog documents from it's sync source, it will first try and create a new cursor on that sync source before choosing a new one (see the code here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/db/repl/oplog_fetcher.cpp#L493) 

      In doing so, it will first attempt to re-establish the connection to the sync source using it's existing DBClientConnection. But in certain edge-cases, this connection-reestablishment process can make synchronous/blocking networking calls without correctly configured timeouts. In cases where the remote host the networking call is trying to communicate with is completely unresponsive, this can block the oplog fetching thread indefinitely, preventing the selection of a new sync-source. In this case, it is possible for a node to continue to use an unavailable node as its "sync-source" until the unavailable node re-contacts the set, allowing the blocking networking call to finish and the oplog fetcher to shut down. 

       

      I found two cases so far where these calls seem to be made: 

      In the first case, we perform a blocking SSL handshake without a timeout configured on the socket; if the remote node becomes unavailable between establishing the connection and the SSL handshake, we'll see this issue.

      In DBClientConnection::_checkConnection(), between the "Trying to reconnect" log line and the "Reconnect attempt failed" log line, the thread attempting to re-establish connection to the sync source (node 01) calls DBClientConnection::connect() (see the source code here).  
      In trying to re-establish the connection, this thread first calls DBClientConnection::connectSocketOnly(), which in turn calls TransportLayerAsio::connect(). This function delegates to the ASIO library to establish the connection, and then creates an ASIOSession to manage it. As Atlas uses SSL to ensure connection safety, TransportLayerAsio::connect() will then attempt an SSL handshake here, also done through ASIO. However, the SSL handshake is attempted synchronously, and no timeout is set on the socket before the attempt (see the code for ensureSync here), so the handshake attempt will block until it errors or completes. This means that if the remote server (in this case node 01) stops responding to network requests (as it does here), the handshake attempt will block until the remote node comes back up and is able to respond. This explains why when the remote host (node 01) comes back online around 2020-12-11T07:43:05, node 00 is able to progress – the FTDC metrics that a slow SSL op is logged at this time on node 00, and expectedly, the connection has timed out and we see this error propogated from connectSocketOnly up through DBClientConnection::connect --> DBClientConnection::_checkConnection and eventualy DBClientCursor::init into the logs.

      A second similar case I've been able to reproduce locally:

      When the oplog fetcher creates a new cursor here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/db/repl/oplog_fetcher.cpp#L647 , the old one falls out of scope and is destroyed. The call stack below ensues (I've abbreviated parts):

      in recvmsg()
      ...
      in asio::read
      in mongo::transport::TransportLayerAsio::AsioSession::opportunisticRead
      in mongo::transport::TransportLayerAsio::AsioSession::read
      in mongo::transport::TransportLayerAsio::AsioSession::sourceMessageImpl
      in mongo::transport::TransportLayerAsio::AsioSession::sourceMessage
      in mongo::DBClientConnection::call
      in mongo::DBClientBase::runCommandWithTarget
      in mongo::DBClientConnection::runCommandWithTarget
      in mongo::DBClientBase::runCommand
      in mongo::(DBClientConnection)::initWireVersion
      in mongo::DBClientConnection::connect
      in mongo::ConnectionString::connect
      ....
      in mongo ScopedDbConnection constructor
      in mongo::DBClientCursor::kill
      in mongo::DBClientCursor destructor
      

      In this code, the ScopedDbConnection constructor here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/client/connpool.h#L471 constructs a DbClientConnection with a socket timeout of 0, which translates here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/client/dbclient_connection.cpp#L610-L611 into the DBClientConnection storing the value for the socket timeout as boost::none. When it sets that timeout on the ASIOSession here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/transport/session_asio.h#L204, the ASIOSession stores boost::none as the timeout. This means that when ASIOSession::sourceMessage is called, no timeout is set on the socket before the read is performed (see https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/transport/session_asio.h#L302). This allows the read to block indefinitely until the node comes back online.

      I've repro'd the issue on 4.4.1, but we should also audit the main branch for this issue – it doesn't look like the relevant code has changed.

       

      Next steps:

      Write a test to reproduce the issue described in this ticket 

      Determine the level at which level to solve this issue 

       

            Assignee:
            amirsaman.memaripour@mongodb.com Amirsaman Memaripour
            Reporter:
            george.wangensteen@mongodb.com George Wangensteen
            Votes:
            0 Vote for this issue
            Watchers:
            23 Start watching this issue

              Created:
              Updated:
              Resolved: