[SERVER-16148] Replica logs excessively when other member is down Created: 14/Nov/14  Updated: 17/Nov/15  Resolved: 12/Jan/15

Status: Closed
Project: Core Server
Component/s: Logging, Replication
Affects Version/s: 2.8.0-rc0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Scott Hernandez (Inactive) Assignee: Andy Schwerin
Resolution: Won't Fix Votes: 0
Labels: healthcheck, heartbeat
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-21504 Lower log verbosity at level 0 when r... Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

During the heartbeat health checking when a member is down logging is excessive in some cases:

2014-11-14T08:38:47.758-0500 W NETWORK  [ReplExecNetThread-4] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused
2014-11-14T08:38:48.217-0500 I REPLSETS [ReplicationExecutor] could not find member to sync from
2014-11-14T08:38:49.220-0500 I REPLSETS [ReplicationExecutor] could not find member to sync from
2014-11-14T08:38:49.763-0500 W NETWORK  [ReplExecNetThread-5] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused
2014-11-14T08:38:49.764-0500 W NETWORK  [ReplExecNetThread-6] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused
2014-11-14T08:38:49.764-0500 W NETWORK  [ReplExecNetThread-7] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused
2014-11-14T08:38:50.224-0500 I REPLSETS [ReplicationExecutor] could not find member to sync from
2014-11-14T08:38:51.229-0500 I REPLSETS [ReplicationExecutor] could not find member to sync from
2014-11-14T08:38:51.769-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused
2014-11-14T08:38:51.770-0500 W NETWORK  [ReplExecNetThread-1] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused
2014-11-14T08:38:51.770-0500 W NETWORK  [ReplExecNetThread-2] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused
2014-11-14T08:38:52.233-0500 I REPLSETS [ReplicationExecutor] could not find member to sync from
2014-11-14T08:38:53.236-0500 I REPLSETS [ReplicationExecutor] could not find member to sync from
2014-11-14T08:38:53.773-0500 W NETWORK  [ReplExecNetThread-3] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused
2014-11-14T08:38:53.774-0500 W NETWORK  [ReplExecNetThread-4] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused
2014-11-14T08:38:53.774-0500 W NETWORK  [ReplExecNetThread-5] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused
2014-11-14T08:38:54.241-0500 I REPLSETS [ReplicationExecutor] could not find member to sync from
2014-11-14T08:38:55.245-0500 I REPLSETS [ReplicationExecutor] could not find member to sync from
2014-11-14T08:38:55.779-0500 W NETWORK  [ReplExecNetThread-6] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused
2014-11-14T08:38:55.780-0500 W NETWORK  [ReplExecNetThread-7] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused
2014-11-14T08:38:55.780-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused
2014-11-14T08:38:56.249-0500 I REPLSETS [ReplicationExecutor] could not find member to sync from
2014-11-14T08:38:57.253-0500 I REPLSETS [ReplicationExecutor] could not find member to sync from
2014-11-14T08:38:57.785-0500 W NETWORK  [ReplExecNetThread-1] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused
2014-11-14T08:38:57.785-0500 W NETWORK  [ReplExecNetThread-2] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused
2014-11-14T08:38:57.786-0500 W NETWORK  [ReplExecNetThread-3] Failed to connect to 127.0.0.1:27018, reason: errno:61 Connection refused



 Comments   
Comment by Daniel Pasette (Inactive) [ 12/Jan/15 ]

This is a problem in the network layer. Shall be addressed in the next dev cycle.

Comment by Scott Hernandez (Inactive) [ 02/Jan/15 ]

As of master (right before RC5):

m31000| 2015-01-02T10:28:54.535-0500 I NETWORK  [conn11] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.4.121.226:62073] 
 m31000| 2015-01-02T10:28:57.152-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.
 m31000| 2015-01-02T10:28:57.152-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to pans.local:31001; Location18915 Failed attempt to connect to pans.local:31001; couldn't connect to server pans.local:31001 (10.4.121.226), connection attempt failed
 m31000| 2015-01-02T10:29:02.158-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.
 m31000| 2015-01-02T10:29:02.158-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to pans.local:31001; Location18915 Failed attempt to connect to pans.local:31001; couldn't connect to server pans.local:31001 (10.4.121.226), connection attempt failed
 m31000| 2015-01-02T10:29:09.165-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.
 m31000| 2015-01-02T10:29:09.165-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to pans.local:31001; Location18915 Failed attempt to connect to pans.local:31001; couldn't connect to server pans.local:31001 (10.4.121.226), connection attempt failed
 m31000| 2015-01-02T10:29:14.169-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.
 m31000| 2015-01-02T10:29:14.169-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to pans.local:31001; Location18915 Failed attempt to connect to pans.local:31001; couldn't connect to server pans.local:31001 (10.4.121.226), connection attempt failed
 m31000| 2015-01-02T10:29:21.176-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.
 m31000| 2015-01-02T10:29:21.176-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to pans.local:31001; Location18915 Failed attempt to connect to pans.local:31001; couldn't connect to server pans.local:31001 (10.4.121.226), connection attempt failed
 m31000| 2015-01-02T10:29:26.177-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.
 m31000| 2015-01-02T10:29:26.177-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to pans.local:31001; Location18915 Failed attempt to connect to pans.local:31001; couldn't connect to server pans.local:31001 (10.4.121.226), connection attempt failed
 m31000| 2015-01-02T10:29:33.183-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.
 m31000| 2015-01-02T10:29:33.183-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to pans.local:31001; Location18915 Failed attempt to connect to pans.local:31001; couldn't connect to server pans.local:31001 (10.4.121.226), connection attempt failed
 m31000| 2015-01-02T10:29:38.189-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.
 m31000| 2015-01-02T10:29:38.189-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to pans.local:31001; Location18915 Failed attempt to connect to pans.local:31001; couldn't connect to server pans.local:31001 (10.4.121.226), connection attempt failed
 m31000| 2015-01-02T10:29:45.197-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.
 m31000| 2015-01-02T10:29:45.197-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to pans.local:31001; Location18915 Failed attempt to connect to pans.local:31001; couldn't connect to server pans.local:31001 (10.4.121.226), connection attempt failed
 m31000| 2015-01-02T10:29:50.198-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.
 m31000| 2015-01-02T10:29:50.198-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to pans.local:31001; Location18915 Failed attempt to connect to pans.local:31001; couldn't connect to server pans.local:31001 (10.4.121.226), connection attempt failed
 m31000| 2015-01-02T10:29:57.208-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.
 m31000| 2015-01-02T10:29:57.209-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to pans.local:31001; Location18915 Failed attempt to connect to pans.local:31001; couldn't connect to server pans.local:31001 (10.4.121.226), connection attempt failed
 m31000| 2015-01-02T10:30:02.210-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.
 m31000| 2015-01-02T10:30:02.211-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to pans.local:31001; Location18915 Failed attempt to connect to pans.local:31001; couldn't connect to server pans.local:31001 (10.4.121.226), connection attempt failed
 m31000| 2015-01-02T10:30:09.218-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.
 m31000| 2015-01-02T10:30:09.218-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to pans.local:31001; Location18915 Failed attempt to connect to pans.local:31001; couldn't connect to server pans.local:31001 (10.4.121.226), connection attempt failed
 m31000| 2015-01-02T10:30:14.224-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.
 m31000| 2015-01-02T10:30:14.224-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to pans.local:31001; Location18915 Failed attempt to connect to pans.local:31001; couldn't connect to server pans.local:31001 (10.4.121.226), connection attempt failed
 m31000| 2015-01-02T10:30:21.229-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.
 m31000| 2015-01-02T10:30:21.229-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to pans.local:31001; Location18915 Failed attempt to connect to pans.local:31001; couldn't connect to server pans.local:31001 (10.4.121.226), connection attempt failed
 m31000| 2015-01-02T10:30:26.230-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 10.4.121.226:31001 after 5000 milliseconds, giving up.

This seems to print 2 lines every 5/7 seconds for each member down. I believe we should reduce this even more, by 6-100x. Even if we logged every 2 minutes that seems like more than anyone would want. With one host down we now log more than 15,000 lines per day (over 140 an hour) and there is no new/useful information there, just repeated errors. In this case it would be useful to keep track of the last failure cause and only log on a different one, for example.

Just ran this simple test:

var rt = new ReplSetTest({nodes:2});
rt.startSet()
rt.initiate()
rt.stop(1)
sleep(1000*90)
rt.start(1)

Comment by Andy Schwerin [ 02/Dec/14 ]

scotthernandez, can you update with the current behavior?

Comment by Scott Hernandez (Inactive) [ 18/Nov/14 ]

In 2.6 It is much less verbose:

2014-11-18T13:08:43.890-0500 [conn1] end connection 10.85.21.19:55383 (0 connections now open)
2014-11-18T13:08:43.892-0500 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: pans.local:27017
2014-11-18T13:08:43.892-0500 [rsBackgroundSync] replSet syncing to: pans.local:27017
2014-11-18T13:08:44.578-0500 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-18T13:08:44.578-0500 [rsHealthPoll] replset info pans.local:27017 heartbeat failed, retrying
2014-11-18T13:08:48.896-0500 [rsBackgroundSync] repl: couldn't connect to server pans.local:27017 (10.85.21.19), connection attempt failed
2014-11-18T13:08:49.585-0500 [rsHealthPoll] replSet info pans.local:27017 is down (or slow to respond): 
2014-11-18T13:08:49.585-0500 [rsHealthPoll] replSet member pans.local:27017 is now in state DOWN
2014-11-18T13:08:49.585-0500 [rsMgr] replSet can't see a majority, will not try to elect self
2014-11-18T13:08:56.595-0500 [rsHealthPoll] replset info pans.local:27017 heartbeat failed, retrying
2014-11-18T13:09:08.612-0500 [rsHealthPoll] replset info pans.local:27017 heartbeat failed, retrying
2014-11-18T13:09:20.629-0500 [rsHealthPoll] replset info pans.local:27017 heartbeat failed, retrying

Also, since I posted the example for master I think we have removed those "could not find member to sync from" repeats.

Comment by Andy Schwerin [ 18/Nov/14 ]

scotthernandez, how does this compare to 2.6? Trying to prioritize.

Generated at Thu Feb 08 03:40:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.