[SERVER-37412] Decrease connection pooling and replication heartbeat default log verbosity Created: 01/Oct/18  Updated: 08/Jan/24  Resolved: 20/Feb/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.1.9

Type: Bug Priority: Major - P3
Reporter: Eric Milkie Assignee: Benjamin Caimano (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2018-10-22, Repl 2018-11-05, Repl 2018-11-19, Service Arch 2018-12-03, Service Arch 2018-12-17, Service Arch 2018-12-31, Service Arch 2019-01-14, Service Arch 2019-01-28, Service Arch 2019-02-11, Service Arch 2019-02-25
Participants:

 Description   

If you happen to have a node in your config that is down, the system log on all the other nodes each fill with 12 log messages every half second (24 messages per second). This seems excessive.

Example of all the messages logged in 3 ms:

2018-10-01T13:45:16.770-0400 I ASIO     [Replication] Connecting to localhost:27019
2018-10-01T13:45:16.770-0400 I ASIO     [Replication] Failed to connect to localhost:27019 - HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused
2018-10-01T13:45:16.770-0400 I ASIO     [Replication] Dropping all pooled connections to localhost:27019 due to HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused
2018-10-01T13:45:16.770-0400 I REPL_HB  [replexec-3] Error in heartbeat (requestId: 225) to localhost:27019, response status: HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused
2018-10-01T13:45:16.771-0400 I ASIO     [Replication] Connecting to localhost:27019
2018-10-01T13:45:16.771-0400 I ASIO     [Replication] Failed to connect to localhost:27019 - HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused
2018-10-01T13:45:16.771-0400 I ASIO     [Replication] Dropping all pooled connections to localhost:27019 due to HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused
2018-10-01T13:45:16.771-0400 I REPL_HB  [replexec-3] Error in heartbeat (requestId: 226) to localhost:27019, response status: HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused
2018-10-01T13:45:16.771-0400 I ASIO     [Replication] Connecting to localhost:27019
2018-10-01T13:45:16.771-0400 I ASIO     [Replication] Failed to connect to localhost:27019 - HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused
2018-10-01T13:45:16.771-0400 I ASIO     [Replication] Dropping all pooled connections to localhost:27019 due to HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused
2018-10-01T13:45:16.772-0400 I REPL_HB  [replexec-3] Error in heartbeat (requestId: 227) to localhost:27019, response status: HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused

I wonder if the volume of ASIO messages could be reduced in this situation.



 Comments   
Comment by Githook User [ 20/Feb/19 ]

Author:

{'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'}

Message: SERVER-37412 Added LogSeverityLimiter for timed logging
Branch: master
https://github.com/mongodb/mongo/commit/6cdb28ab8df5dff06be82b4c46971fe5298c6f46

Comment by Githook User [ 20/Feb/19 ]

Author:

{'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'}

Message: SERVER-37412 Decrease replication heartbeat log verbosity
Branch: master
https://github.com/mongodb/mongo/commit/601ed1b88afe54f79e39c298cd2c578795bfc17b

Comment by Benjamin Caimano (Inactive) [ 15/Jan/19 ]

Mildly stalled in code review. My intention is to generate the widget mira.carey@mongodb.com requested in the current sprint.

Comment by Gregory McKeon (Inactive) [ 15/Jan/19 ]

ben.caimano is this active?

Comment by Benjamin Caimano (Inactive) [ 14/Nov/18 ]

Can you help me understand what that special case is doing?

If we get an error code back from a network call, we dump all connections to that URI in that pool. Since we allow the user to set the time limit, this particular case isn't necessarily indicative of a failure from the remote host. Thus we skip dumping the pool and just replace our single connection object.

Since the default heartbeat interval is 2 seconds, we were concerned that we may send heartbeats too frequently to nodes that are down if the network interface is automatically retrying them. Do you know if that's the case, or how I would find that out?

So the heart beat stuff is a bit spaghetti. You can definitely find out when it is sending out heartbeats just by upping the verbosity log level to 2 (sends here, receives here). One way or another, your command response will start here. A lot of the nitty gritty of the actual executor will flood the log at verbosity level 3 if that helps. That said, since the pool drop is immediately followed by the heartbeat error, I believe that you're always ending up back here where you schedule again.

I don't believe that the network interface should be retrying them. Still, sometimes the networking layer is surprisingly byzantine, perhaps I've missed an edge case. The ms delay worries me a bit, could it be the natural delay on an attempt to immediately reschedule the heartbeat?

Comment by Tess Avitabile (Inactive) [ 14/Nov/18 ]

ben.caimano, can you help me understand what that special case is doing? Since the default heartbeat interval is 2 seconds, we were concerned that we may send heartbeats too frequently to nodes that are down if the network interface is automatically retrying them. Do you know if that's the case, or how I would find that out? Heartbeats are scheduled using the ThreadPoolTaskExecutor.

Comment by Benjamin Caimano (Inactive) [ 19/Oct/18 ]

tess.avitabile, I think I may have related code from the planned maintaince work. HostUnreachable in the ConnPool is usually either very temporary or very permanent. It probably needs a separate case like here.

Comment by Craig Homa [ 01/Oct/18 ]

Please investigate whether the excessive logging is due to sending heartbeats too frequently. 

Generated at Thu Feb 08 04:45:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.