[SERVER-45270] Increased vulnerability to slow DNS Created: 20/Dec/19 Updated: 08/Jan/24 Resolved: 30/Dec/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 4.2.1 |
| Fix Version/s: | 4.2.3, 4.3.3 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Lingzhi Deng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | KP42 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Backport Requested: |
v4.2
|
||||||||||||||||||||||||||||
| Steps To Reproduce: | Some customers have observed 4.2 to be less stable than 4.0 if there are delays in DNS of a few seconds. This can be reproduced by adding sleeps to the various places where getaddrinfo is called, which causes repeated elections, delays in connecting, and gaps in ftdc or other data collection, and probably other problems, in 4.2 but not in 4.0. This appears to be because in 4.2 ReplicationCoordinatorImpl::processReplSetGetStatus calls TopologyCoordinator::prepareStatusResponse while holding ReplicationCoordinatorImpl::_mutex, and then TopologyCoordinator::prepareStatusResponse calls resolveAddrInfo while holding that lock here:
If the getaddrinfo in resolveAddrInfo is delayed this causes delays in responding to heartbeats and other replication-related operations. Since replSetGetStatus is executed frequently by monitoring software (ftdc, Cloud monitoring) this means that there is a high likelihood that a delay in DNS will result in an election among other issues. Based on some added debug logging it appears that in 4.0 we don't call resolveAddrInfo on every replSetGetStatus, and apparently only call getaddrinfo from getAddrsForHost infrequently, e.g. at startup, so it is much less vulnerable to DNS issues. I don't know if this is the only path where we can call getaddrinfo while holding a lock; possibly we should audit the code for that. |
||||||||||||||||||||||||||||
| Sprint: | Repl 2019-12-30, Repl 2020-01-13 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||
| Comments |
| Comment by Lingzhi Deng [ 26/Dec/19 ] |
|
I re-opened |
| Comment by Githook User [ 26/Dec/19 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: This reverts commit e1e95afbd58a7449dd1765cb910b4d136f95fcc4. |
| Comment by Andy Schwerin [ 23/Dec/19 ] |
|
Maybe we should stop logging while holding mute Ed. Or is it the logger On Mon, Dec 23, 2019 at 12:17 PM Eric Milkie (Jira) <jira@mongodb.org> |
| Comment by Eric Milkie [ 23/Dec/19 ] |
|
Coverity does have such a checker, but I had to turn it off because it flags thousands of places in our code - mostly due to log() statements, since those do I/O. We need to write a special model for that function for Coverity, and then the checker could become useful. |
| Comment by Bruce Lucas (Inactive) [ 23/Dec/19 ] |
|
mira.carey@mongodb.com, I'm wondering if issues like this could be detected statically by one of our checkers - alert if a mutex is held while one of a list of potentially long-running calls involving i/o is done? I wonder if we have more undetected instances of this in our code. |
| Comment by Mira Carey [ 20/Dec/19 ] |
|
I'm fine reverting. I'd like to think harder about how we do dns, and see if I might incidentally be able to revert this revert, but I don't think I'd ever want to go out of my way for the functionality in this ticket |
| Comment by Tess Avitabile (Inactive) [ 20/Dec/19 ] |
|
mira.carey@mongodb.com, is it okay to revert the change? Alternatively, to fix it, we would need to add to the replSetGetStatus response outside of the mutex after the TopologyCoordinator returns. I think the best way to do that would be to IDL-ify the response. This could take a day or two. |
| Comment by Andy Schwerin [ 20/Dec/19 ] |
|
|
| Comment by Mira Carey [ 20/Dec/19 ] |
|
Looks like this was introduced in |