[SERVER-20144] lastHeartbeatMessage says "could not find member to sync from" when set is healthy Created: 26/Aug/15 Updated: 06/Dec/22 Resolved: 11/Sep/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Trivial - P5 |
| Reporter: | Daniel Pasette (Inactive) | Assignee: | Backlog - Replication Team |
| Resolution: | Done | Votes: | 11 |
| Labels: | neweng | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Assigned Teams: |
Replication
|
||||
| Operating System: | ALL | ||||
| Steps To Reproduce: | Start a two node replset (default params) |
||||
| Participants: | |||||
| Case: | (copied to CRM) | ||||
| Description |
|
Unfortunately this doesn't happen deterministically, but it is easy to trigger. It appears that the lastHeartbeatMessage is just not cleared. This usually is cleared once documents are inserted, so marking as a trivial issue.
Attaching log files from primary and secondary |
| Comments |
| Comment by hydrapolic [ 07/Jul/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, it's a bit misleading when you have this in the logs and see it in rs.status(). | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mike Zraly [ 14/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
An interesting use case for this message – in our AWS scalability test environment, we have restored a replica set using the same EBS snapshot for all three replica set members. Naturally their optime is the same in all three members, so all 3 come up as RECOVERING and complain they could not member to sync from. I am attempting to work around this by restoring one member from a snapshot taken 20 minutes later to get around this but it seems like restoring all replica set members from a common snapshot would be a reasonable use case. If anyone can suggest a different workaround I'd love to hear it. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by luigi fratini [ 27/Sep/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi , I've the same questions. My mongo DB version is 3.2.9-1 installed on CentOS 7.1 x86_64 This is the output of secondary node :
| ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by pravin dwiwedi [ 11/Jul/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Actually this is a misleading message and creating problem in our monitoring tool..... | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Zhang Youdong [ 11/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
After read the source code and do some testing,the logic are all correct,it's a by-design problem. When secondary choose a source to sync, it will choose a node who's oplog is newer (not equal) than self, so after startup,when all nodes have some data,the oplog will be same,so secondary cannot choose a sync souce, write after a write operation happens, primary will have newer oplog,secondary can successfully choose a targe to sync from,the error message will disappear. MongoDB 3.0 has the same problem,but because 3.0 have bug,the problem is hidden. std::string TopologyCoordinatorImpl::_getHbmsg(Date_t now) const { return _hbmsg; |