[SERVER-37412] Decrease connection pooling and replication heartbeat default log verbosity Created: 01/Oct/18 Updated: 08/Jan/24 Resolved: 20/Feb/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.9 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Eric Milkie | Assignee: | Benjamin Caimano (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Backwards Compatibility: | Fully Compatible |
| Operating System: | ALL |
| Sprint: | Repl 2018-10-22, Repl 2018-11-05, Repl 2018-11-19, Service Arch 2018-12-03, Service Arch 2018-12-17, Service Arch 2018-12-31, Service Arch 2019-01-14, Service Arch 2019-01-28, Service Arch 2019-02-11, Service Arch 2019-02-25 |
| Participants: |
| Description |
|
If you happen to have a node in your config that is down, the system log on all the other nodes each fill with 12 log messages every half second (24 messages per second). This seems excessive. Example of all the messages logged in 3 ms:
I wonder if the volume of ASIO messages could be reduced in this situation. |
| Comments |
| Comment by Githook User [ 20/Feb/19 ] |
|
Author: {'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'}Message: |
| Comment by Githook User [ 20/Feb/19 ] |
|
Author: {'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'}Message: |
| Comment by Benjamin Caimano (Inactive) [ 15/Jan/19 ] |
|
Mildly stalled in code review. My intention is to generate the widget mira.carey@mongodb.com requested in the current sprint. |
| Comment by Gregory McKeon (Inactive) [ 15/Jan/19 ] |
|
ben.caimano is this active? |
| Comment by Benjamin Caimano (Inactive) [ 14/Nov/18 ] |
If we get an error code back from a network call, we dump all connections to that URI in that pool. Since we allow the user to set the time limit, this particular case isn't necessarily indicative of a failure from the remote host. Thus we skip dumping the pool and just replace our single connection object.
So the heart beat stuff is a bit spaghetti. You can definitely find out when it is sending out heartbeats just by upping the verbosity log level to 2 (sends here, receives here). One way or another, your command response will start here. A lot of the nitty gritty of the actual executor will flood the log at verbosity level 3 if that helps. That said, since the pool drop is immediately followed by the heartbeat error, I believe that you're always ending up back here where you schedule again. I don't believe that the network interface should be retrying them. Still, sometimes the networking layer is surprisingly byzantine, perhaps I've missed an edge case. The ms delay worries me a bit, could it be the natural delay on an attempt to immediately reschedule the heartbeat? |
| Comment by Tess Avitabile (Inactive) [ 14/Nov/18 ] |
|
ben.caimano, can you help me understand what that special case is doing? Since the default heartbeat interval is 2 seconds, we were concerned that we may send heartbeats too frequently to nodes that are down if the network interface is automatically retrying them. Do you know if that's the case, or how I would find that out? Heartbeats are scheduled using the ThreadPoolTaskExecutor. |
| Comment by Benjamin Caimano (Inactive) [ 19/Oct/18 ] |
|
tess.avitabile, I think I may have related code from the planned maintaince work. HostUnreachable in the ConnPool is usually either very temporary or very permanent. It probably needs a separate case like here. |
| Comment by Craig Homa [ 01/Oct/18 ] |
|
Please investigate whether the excessive logging is due to sending heartbeats too frequently. |