Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-45270

Increased vulnerability to slow DNS

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 4.2.1
    • Fix Version/s: 4.2.3, 4.3.3
    • Component/s: Replication
    • Labels:
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.2
    • Steps To Reproduce:
      Hide

      Some customers have observed 4.2 to be less stable than 4.0 if there are delays in DNS of a few seconds. This can be reproduced by adding sleeps to the various places where getaddrinfo is called, which causes repeated elections, delays in connecting, and gaps in ftdc or other data collection, and probably other problems, in 4.2 but not in 4.0.

      This appears to be because in 4.2 ReplicationCoordinatorImpl::processReplSetGetStatus calls TopologyCoordinator::prepareStatusResponse while holding ReplicationCoordinatorImpl::_mutex, and then TopologyCoordinator::prepareStatusResponse calls resolveAddrInfo while holding that lock here:

      #0  mongo::(anonymous namespace)::resolveAddrInfo(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, unsigned short) () at src/mongo/util/net/sockaddr.cpp:66
      #1  0x000055fc7c41236d in mongo::SockAddr::SockAddr(mongo::StringData, int, unsigned short) () at src/mongo/util/net/sockaddr.cpp:144
      #2  0x000055fc7c414791 in mongo::hostbyname[abi:cxx11](char const*) () at src/mongo/base/string_data.h:97
      #3  0x000055fc7ad059d2 in mongo::repl::(anonymous namespace)::appendIP(mongo::BSONObjBuilder*, char const*, mongo::HostAndPort const&) [clone .isra.528] ()
      #4  0x000055fc7ad0dd33 in mongo::repl::TopologyCoordinator::prepareStatusResponse(mongo::repl::TopologyCoordinator::ReplSetStatusArgs const&, mongo::BSONObjBuilder*, mongo::Status*) ()
      #5  0x000055fc7a72f14b in mongo::repl::ReplicationCoordinatorImpl::processReplSetGetStatus (this=0x55fc803a1080, response=0x7f325c9fb740, responseStyle=<optimized out>)
      #6  0x000055fc7acb8db0 in mongo::repl::CmdReplSetGetStatus::run (this=<optimized out>, opCtx=0x55fc850d6a00, cmdObj=..., result=...) at src/mongo/db/repl/repl_set_get_status_cmd.cpp:67
      #7  0x000055fc7be899b5 in mongo::BasicCommand::Invocation::run (result=0x7f325c9fb6b0, opCtx=0x55fc850d6a00, this=0x55fc854d1940) at src/mongo/db/commands.cpp:590
      #8  mongo::CommandHelpers::runCommandDirectly(mongo::OperationContext*, mongo::OpMsgRequest const&) () at src/mongo/db/commands.cpp:146
      #9  0x000055fc7ae812cd in mongo::FTDCSimpleInternalCommandCollector::collect(mongo::OperationContext*, mongo::BSONObjBuilder&) () at src/mongo/db/ftdc/ftdc_server.cpp:183
      #10 0x000055fc7aeacbf4 in mongo::FTDCCollectorCollection::collect(mongo::Client*) () at /hdd1/mongodbtoolchain/stow/gcc-v3.Ifr/include/c++/8.2.0/bits/unique_ptr.h:342
      #11 0x000055fc7aeb0d8b in mongo::FTDCController::doLoop() () at src/mongo/db/ftdc/controller.cpp:245
      #12 0x000055fc7c5a0cdf in execute_native_thread_routine () at ../../../../../src/combined/libstdc++-v3/src/c++11/thread.cc:80
      #13 0x00007f326ab116db in start_thread (arg=0x7f325c9fc700) at pthread_create.c:463
      #14 0x00007f326a83a88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      If the getaddrinfo in resolveAddrInfo is delayed this causes delays in responding to heartbeats and other replication-related operations. Since replSetGetStatus is executed frequently by monitoring software (ftdc, Cloud monitoring) this means that there is a high likelihood that a delay in DNS will result in an election among other issues.

      Based on some added debug logging it appears that in 4.0 we don't call resolveAddrInfo on every replSetGetStatus, and apparently only call getaddrinfo from getAddrsForHost infrequently, e.g. at startup, so it is much less vulnerable to DNS issues.

      I don't know if this is the only path where we can call getaddrinfo while holding a lock; possibly we should audit the code for that.

      Show
      Some customers have observed 4.2 to be less stable than 4.0 if there are delays in DNS of a few seconds. This can be reproduced by adding sleeps to the various places where getaddrinfo is called, which causes repeated elections, delays in connecting, and gaps in ftdc or other data collection, and probably other problems, in 4.2 but not in 4.0. This appears to be because in 4.2 ReplicationCoordinatorImpl::processReplSetGetStatus calls TopologyCoordinator::prepareStatusResponse while holding ReplicationCoordinatorImpl::_mutex, and then TopologyCoordinator::prepareStatusResponse calls resolveAddrInfo while holding that lock here: #0 mongo::(anonymous namespace)::resolveAddrInfo(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, unsigned short) () at src/mongo/util/net/sockaddr.cpp:66 #1 0x000055fc7c41236d in mongo::SockAddr::SockAddr(mongo::StringData, int, unsigned short) () at src/mongo/util/net/sockaddr.cpp:144 #2 0x000055fc7c414791 in mongo::hostbyname[abi:cxx11](char const*) () at src/mongo/base/string_data.h:97 #3 0x000055fc7ad059d2 in mongo::repl::(anonymous namespace)::appendIP(mongo::BSONObjBuilder*, char const*, mongo::HostAndPort const&) [clone .isra.528] () #4 0x000055fc7ad0dd33 in mongo::repl::TopologyCoordinator::prepareStatusResponse(mongo::repl::TopologyCoordinator::ReplSetStatusArgs const&, mongo::BSONObjBuilder*, mongo::Status*) () #5 0x000055fc7a72f14b in mongo::repl::ReplicationCoordinatorImpl::processReplSetGetStatus (this=0x55fc803a1080, response=0x7f325c9fb740, responseStyle=<optimized out>) #6 0x000055fc7acb8db0 in mongo::repl::CmdReplSetGetStatus::run (this=<optimized out>, opCtx=0x55fc850d6a00, cmdObj=..., result=...) at src/mongo/db/repl/repl_set_get_status_cmd.cpp:67 #7 0x000055fc7be899b5 in mongo::BasicCommand::Invocation::run (result=0x7f325c9fb6b0, opCtx=0x55fc850d6a00, this=0x55fc854d1940) at src/mongo/db/commands.cpp:590 #8 mongo::CommandHelpers::runCommandDirectly(mongo::OperationContext*, mongo::OpMsgRequest const&) () at src/mongo/db/commands.cpp:146 #9 0x000055fc7ae812cd in mongo::FTDCSimpleInternalCommandCollector::collect(mongo::OperationContext*, mongo::BSONObjBuilder&) () at src/mongo/db/ftdc/ftdc_server.cpp:183 #10 0x000055fc7aeacbf4 in mongo::FTDCCollectorCollection::collect(mongo::Client*) () at /hdd1/mongodbtoolchain/stow/gcc-v3.Ifr/include/c++/8.2.0/bits/unique_ptr.h:342 #11 0x000055fc7aeb0d8b in mongo::FTDCController::doLoop() () at src/mongo/db/ftdc/controller.cpp:245 #12 0x000055fc7c5a0cdf in execute_native_thread_routine () at ../../../../../src/combined/libstdc++-v3/src/c++11/thread.cc:80 #13 0x00007f326ab116db in start_thread (arg=0x7f325c9fc700) at pthread_create.c:463 #14 0x00007f326a83a88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 If the getaddrinfo in resolveAddrInfo is delayed this causes delays in responding to heartbeats and other replication-related operations. Since replSetGetStatus is executed frequently by monitoring software (ftdc, Cloud monitoring) this means that there is a high likelihood that a delay in DNS will result in an election among other issues. Based on some added debug logging it appears that in 4.0 we don't call resolveAddrInfo on every replSetGetStatus, and apparently only call getaddrinfo from getAddrsForHost infrequently, e.g. at startup, so it is much less vulnerable to DNS issues. I don't know if this is the only path where we can call getaddrinfo while holding a lock; possibly we should audit the code for that.
    • Sprint:
      Repl 2019-12-30, Repl 2020-01-13
    • Case:

      Attachments

        Issue Links

          Activity

            People

            Assignee:
            lingzhi.deng Lingzhi Deng
            Reporter:
            bruce.lucas Bruce Lucas
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            24 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: