Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.2.3, 4.3.3
Affects Version/s: 4.2.1
Component/s: Replication
Labels:
- KP42

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v4.2
Steps To Reproduce:
Hide

Some customers have observed 4.2 to be less stable than 4.0 if there are delays in DNS of a few seconds. This can be reproduced by adding sleeps to the various places where getaddrinfo is called, which causes repeated elections, delays in connecting, and gaps in ftdc or other data collection, and probably other problems, in 4.2 but not in 4.0.

This appears to be because in 4.2 ReplicationCoordinatorImpl::processReplSetGetStatus calls TopologyCoordinator::prepareStatusResponse while holding ReplicationCoordinatorImpl::_mutex, and then TopologyCoordinator::prepareStatusResponse calls resolveAddrInfo while holding that lock here:

#0 mongo::(anonymous namespace)::resolveAddrInfo(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, unsigned short) () at src/mongo/util/net/sockaddr.cpp:66 #1 0x000055fc7c41236d in mongo::SockAddr::SockAddr(mongo::StringData, int, unsigned short) () at src/mongo/util/net/sockaddr.cpp:144 #2 0x000055fc7c414791 in mongo::hostbyname[abi:cxx11](char const*) () at src/mongo/base/string_data.h:97 #3 0x000055fc7ad059d2 in mongo::repl::(anonymous namespace)::appendIP(mongo::BSONObjBuilder*, char const*, mongo::HostAndPort const&) [clone .isra.528] () #4 0x000055fc7ad0dd33 in mongo::repl::TopologyCoordinator::prepareStatusResponse(mongo::repl::TopologyCoordinator::ReplSetStatusArgs const&, mongo::BSONObjBuilder*, mongo::Status*) () #5 0x000055fc7a72f14b in mongo::repl::ReplicationCoordinatorImpl::processReplSetGetStatus (this=0x55fc803a1080, response=0x7f325c9fb740, responseStyle=<optimized out>) #6 0x000055fc7acb8db0 in mongo::repl::CmdReplSetGetStatus::run (this=<optimized out>, opCtx=0x55fc850d6a00, cmdObj=..., result=...) at src/mongo/db/repl/repl_set_get_status_cmd.cpp:67 #7 0x000055fc7be899b5 in mongo::BasicCommand::Invocation::run (result=0x7f325c9fb6b0, opCtx=0x55fc850d6a00, this=0x55fc854d1940) at src/mongo/db/commands.cpp:590 #8 mongo::CommandHelpers::runCommandDirectly(mongo::OperationContext*, mongo::OpMsgRequest const&) () at src/mongo/db/commands.cpp:146 #9 0x000055fc7ae812cd in mongo::FTDCSimpleInternalCommandCollector::collect(mongo::OperationContext*, mongo::BSONObjBuilder&) () at src/mongo/db/ftdc/ftdc_server.cpp:183 #10 0x000055fc7aeacbf4 in mongo::FTDCCollectorCollection::collect(mongo::Client*) () at /hdd1/mongodbtoolchain/stow/gcc-v3.Ifr/include/c++/8.2.0/bits/unique_ptr.h:342 #11 0x000055fc7aeb0d8b in mongo::FTDCController::doLoop() () at src/mongo/db/ftdc/controller.cpp:245 #12 0x000055fc7c5a0cdf in execute_native_thread_routine () at ../../../../../src/combined/libstdc++-v3/src/c++11/thread.cc:80 #13 0x00007f326ab116db in start_thread (arg=0x7f325c9fc700) at pthread_create.c:463 #14 0x00007f326a83a88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

If the getaddrinfo in resolveAddrInfo is delayed this causes delays in responding to heartbeats and other replication-related operations. Since replSetGetStatus is executed frequently by monitoring software (ftdc, Cloud monitoring) this means that there is a high likelihood that a delay in DNS will result in an election among other issues.

Based on some added debug logging it appears that in 4.0 we don't call resolveAddrInfo on every replSetGetStatus, and apparently only call getaddrinfo from getAddrsForHost infrequently, e.g. at startup, so it is much less vulnerable to DNS issues.

I don't know if this is the only path where we can call getaddrinfo while holding a lock; possibly we should audit the code for that.
Show
Some customers have observed 4.2 to be less stable than 4.0 if there are delays in DNS of a few seconds. This can be reproduced by adding sleeps to the various places where getaddrinfo is called, which causes repeated elections, delays in connecting, and gaps in ftdc or other data collection, and probably other problems, in 4.2 but not in 4.0. This appears to be because in 4.2 ReplicationCoordinatorImpl::processReplSetGetStatus calls TopologyCoordinator::prepareStatusResponse while holding ReplicationCoordinatorImpl::_mutex, and then TopologyCoordinator::prepareStatusResponse calls resolveAddrInfo while holding that lock here: #0 mongo::(anonymous namespace)::resolveAddrInfo(std::__cxx11::basic_string< char , std::char_traits< char >, std::allocator< char > > const &, int , unsigned short ) () at src/mongo/util/net/sockaddr.cpp:66 #1 0x000055fc7c41236d in mongo::SockAddr::SockAddr(mongo::StringData, int , unsigned short ) () at src/mongo/util/net/sockaddr.cpp:144 #2 0x000055fc7c414791 in mongo::hostbyname[abi:cxx11]( char const *) () at src/mongo/base/string_data.h:97 #3 0x000055fc7ad059d2 in mongo::repl::(anonymous namespace)::appendIP(mongo::BSONObjBuilder*, char const *, mongo::HostAndPort const &) [clone .isra.528] () #4 0x000055fc7ad0dd33 in mongo::repl::TopologyCoordinator::prepareStatusResponse(mongo::repl::TopologyCoordinator::ReplSetStatusArgs const &, mongo::BSONObjBuilder*, mongo::Status*) () #5 0x000055fc7a72f14b in mongo::repl::ReplicationCoordinatorImpl::processReplSetGetStatus ( this =0x55fc803a1080, response=0x7f325c9fb740, responseStyle=<optimized out>) #6 0x000055fc7acb8db0 in mongo::repl::CmdReplSetGetStatus::run ( this =<optimized out>, opCtx=0x55fc850d6a00, cmdObj=..., result=...) at src/mongo/db/repl/repl_set_get_status_cmd.cpp:67 #7 0x000055fc7be899b5 in mongo::BasicCommand::Invocation::run (result=0x7f325c9fb6b0, opCtx=0x55fc850d6a00, this =0x55fc854d1940) at src/mongo/db/commands.cpp:590 #8 mongo::CommandHelpers::runCommandDirectly(mongo::OperationContext*, mongo::OpMsgRequest const &) () at src/mongo/db/commands.cpp:146 #9 0x000055fc7ae812cd in mongo::FTDCSimpleInternalCommandCollector::collect(mongo::OperationContext*, mongo::BSONObjBuilder&) () at src/mongo/db/ftdc/ftdc_server.cpp:183 #10 0x000055fc7aeacbf4 in mongo::FTDCCollectorCollection::collect(mongo::Client*) () at /hdd1/mongodbtoolchain/stow/gcc-v3.Ifr/include/c++/8.2.0/bits/unique_ptr.h:342 #11 0x000055fc7aeb0d8b in mongo::FTDCController::doLoop() () at src/mongo/db/ftdc/controller.cpp:245 #12 0x000055fc7c5a0cdf in execute_native_thread_routine () at ../../../../../src/combined/libstdc++-v3/src/c++11/thread.cc:80 #13 0x00007f326ab116db in start_thread (arg=0x7f325c9fc700) at pthread_create.c:463 #14 0x00007f326a83a88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 If the getaddrinfo in resolveAddrInfo is delayed this causes delays in responding to heartbeats and other replication-related operations. Since replSetGetStatus is executed frequently by monitoring software (ftdc, Cloud monitoring) this means that there is a high likelihood that a delay in DNS will result in an election among other issues. Based on some added debug logging it appears that in 4.0 we don't call resolveAddrInfo on every replSetGetStatus, and apparently only call getaddrinfo from getAddrsForHost infrequently, e.g. at startup, so it is much less vulnerable to DNS issues. I don't know if this is the only path where we can call getaddrinfo while holding a lock; possibly we should audit the code for that.
Sprint:
Repl 2019-12-30, Repl 2020-01-13
Case:
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

is related to

SERVER-45304 Include resolved IP address in rs.status()

Closed

SERVER-25284 Include resolved IP address in rs.status()

Closed

Assignee:: Lingzhi Deng
Reporter:: Bruce Lucas (Inactive)
Participants:: Andy Schwerin, Bruce Lucas, Eric Milkie, Githook User, Lingzhi Deng, Mira Carey, Tess Avitabile
Votes:: 0 Vote for this issue
Watchers:: 24 Start watching this issue

Created:: Dec 20 2019 02:34:29 PM UTC
Updated:: Jan 08 2024 03:23:17 PM UTC
Resolved:: Dec 30 2019 07:17:17 PM UTC

Details

Attachments

Issue Links

Forms

Activity

People

Dates