[SERVER-20144] lastHeartbeatMessage says "could not find member to sync from" when set is healthy Created: 26/Aug/15  Updated: 06/Dec/22  Resolved: 11/Sep/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Trivial - P5
Reporter: Daniel Pasette (Inactive) Assignee: Backlog - Replication Team
Resolution: Done Votes: 11
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File primary.log     Text File secondary.log    
Issue Links:
Related
Assigned Teams:
Replication
Operating System: ALL
Steps To Reproduce:

Start a two node replset (default params)
rs.initiate()
rs.add("node:30000")
rs.status()

Participants:
Case:

 Description   

Unfortunately this doesn't happen deterministically, but it is easy to trigger. It appears that the lastHeartbeatMessage is just not cleared.

This usually is cleared once documents are inserted, so marking as a trivial issue.

dan:PRIMARY> rs.status()
{
	"set" : "dan",
	"date" : ISODate("2015-08-26T18:16:10.319Z"),
	"myState" : 1,
	"term" : NumberLong(0),
	"heartbeatIntervalMillis" : NumberLong(2000),
	"members" : [
		{
			"_id" : 0,
			"name" : "mortar:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 76,
			"optime" : Timestamp(1440612909, 1),
			"optimeDate" : ISODate("2015-08-26T18:15:09Z"),
			"infoMessage" : "could not find member to sync from",
			"electionTime" : Timestamp(1440612906, 2),
			"electionDate" : ISODate("2015-08-26T18:15:06Z"),
			"configVersion" : 2,
			"self" : true
		},
		{
			"_id" : 1,
			"name" : "mortar:30000",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 61,
			"optime" : Timestamp(1440612909, 1),
			"optimeDate" : ISODate("2015-08-26T18:15:09Z"),
			"lastHeartbeat" : ISODate("2015-08-26T18:16:09.049Z"),
			"lastHeartbeatRecv" : ISODate("2015-08-26T18:16:09.056Z"),
			"pingMs" : NumberLong(1),
			"lastHeartbeatMessage" : "could not find member to sync from",
			"configVersion" : 2
		}
	],
	"ok" : 1
}

Attaching log files from primary and secondary



 Comments   
Comment by hydrapolic [ 07/Jul/17 ]

Yes, it's a bit misleading when you have this in the logs and see it in rs.status().

Comment by Mike Zraly [ 14/Feb/17 ]

An interesting use case for this message – in our AWS scalability test environment, we have restored a replica set using the same EBS snapshot for all three replica set members. Naturally their optime is the same in all three members, so all 3 come up as RECOVERING and complain they could not member to sync from.

I am attempting to work around this by restoring one member from a snapshot taken 20 minutes later to get around this but it seems like restoring all replica set members from a common snapshot would be a reasonable use case.

If anyone can suggest a different workaround I'd love to hear it.

Comment by luigi fratini [ 27/Sep/16 ]

Hi ,

I've the same questions. My mongo DB version is 3.2.9-1 installed on CentOS 7.1 x86_64
I've installed a replicaSet and the secondary node send this intermittent message

This is the output of secondary node :

rs0:SECONDARY> rs.status();
{
        "set" : "rs0",
        "date" : ISODate("2016-09-27T09:46:57.377Z"),
        "myState" : 2,
        "term" : NumberLong(1),
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 0,
                        "name" : "lcaifamndb1:27017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 1221,
                        "optime" : {
                                "ts" : Timestamp(1474968394, 1),
                                "t" : NumberLong(1)
                        },
                        "optimeDate" : ISODate("2016-09-27T09:26:34Z"),
                        "lastHeartbeat" : ISODate("2016-09-27T09:46:53.613Z"),
                        "lastHeartbeatRecv" : ISODate("2016-09-27T09:46:56.951Z"),
                        "pingMs" : NumberLong(0),
                        "electionTime" : Timestamp(1474968357, 2),
                        "electionDate" : ISODate("2016-09-27T09:25:57Z"),
                        "configVersion" : 2
                },
                {
                        "_id" : 1,
                        "name" : "lcaifamndb2:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 1532,
                        "optime" : {
                                "ts" : Timestamp(1474968394, 1),
                                "t" : NumberLong(1)
                        },
                        "optimeDate" : ISODate("2016-09-27T09:26:34Z"),
                        "infoMessage" : "could not find member to sync from",
                        "configVersion" : 2,
                        "self" : true
                }
        ],
        "ok" : 1
}

Comment by pravin dwiwedi [ 11/Jul/16 ]

Actually this is a misleading message and creating problem in our monitoring tool.....
The good thing is-- there is no operational impact at all....
Eagerly waiting for the fixes...

Comment by Zhang Youdong [ 11/Mar/16 ]

After read the source code and do some testing,the logic are all correct,it's a by-design problem.

When secondary choose a source to sync, it will choose a node who's oplog is newer (not equal) than self, so after startup,when all nodes have some data,the oplog will be same,so secondary cannot choose a sync souce, write after a write operation happens, primary will have newer oplog,secondary can successfully choose a targe to sync from,the error message will disappear.

MongoDB 3.0 has the same problem,but because 3.0 have bug,the problem is hidden.

std::string TopologyCoordinatorImpl::_getHbmsg(Date_t now) const {
// ignore messages over 2 minutes old
if ((now - _hbmsgTime) > 120)

{ // here 120 is millseconds, not seconds return ""; }

return _hbmsg;
}

Generated at Thu Feb 08 03:53:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.