[SERVER-36597] primary still contact removed member and now stepdown when major members are down Created: 10/Aug/18  Updated: 04/Sep/18  Resolved: 13/Aug/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.6
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Bruce Zu Assignee: Nick Brewer
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

all details have been provide in https://jira.mongodb.org/browse/SERVER-36512 but this topic focus on 2 new found issues 

I am confused with Nick's feedback. If the Nick's answer is right, then where are the down major members? because of the replset has 1 primary 2 secondaries and 1 arbiter, only one secondary was unreadable when the issue is found.

I did more investigation on the mongod.log find another 2 issues: 

  issue: primary still contact removed 3 members, fixed by restart mongod service

  issue: primary did not step down if it still thinks the replset has 6 data bearing members and 1 arbiter and 4 data bearing members are down.

 
secondary : 172.31.54.204 (primary when the issue happen on Aug 6)
arbiter : ip-172-31-5-208 (was 3.4.7 when the issue happen on Aug 6)
old member 3.4.7 : ip-172-31-12-59 (removed from replset July 23)
old member 3.4.7 : ip-172-31-20-52 (removed from replset July 23)
old member 3.4.7 : ip-172-31-46-24 (removed from replset July 23)
secondary : ip-172-31-66-130 (unreachable when the issue happen on Aug 6)
primary : ip-172-31-82-157 (secondary when the issue happen on Aug 6)
secondary : ip-172-31-67-188 (added afer the issue happen)

All 3.6.6 now

 

 



 Comments   
Comment by Nick Brewer [ 13/Aug/18 ]

brucezu The arbiter does not count toward the read concern majority. With one secondary down in the setup you just described, you would not be able to fulfill a majority read concern. This is not a bug. 

But In my case, the primary member actively connect the removed member. This is not expected. it should be a bug. After reboot mongod service. this issue disappears. 

As I mentioned previously, we have seen this behavior in the past - however it does not mean that the node that the primary is attempting to connect to is still considered to be a member of the replica set. This is unrelated to your issue of not being able to fulfill a read concern majority, and I've linked to the current SERVER tickets that we have tracking improvements in this area. 

-Nick

Comment by Bruce Zu [ 13/Aug/18 ]

Hi Nick

The output of rs.status() shows 4 members in the resplet

secondary : 172.31.54.204 (primary when the issue happen on Aug 6)

arbiter : ip-172-31-5-208 (was 3.4.7 when the issue happen on Aug 6)
secondary : ip-172-31-66-130 (unreachable when the issue happen on Aug 6)
primary : ip-172-31-82-157 (secondary when the issue happen on Aug 6)

This also can be tracked from mongod.log 

The question here is only one for them, 1/3,  was unreachable. 

major data bearing members are still available

but lookaside table started to grow. 

 

In short, I think there is a bug: lookaside table is triggered to grow when major data bearing members are still available. 

 

by the way, Test show when a member is removed from replset.  the removed member still tries to connect primary, and primary accept the connection from the removed member. But primary will never connect the removed member.

But In my case, the primary member actively connect the removed member. This is not expected. it should be a bug. After reboot mongod service. this issue disappears. 

 

 

Comment by Nick Brewer [ 13/Aug/18 ]

 The primary may have continued to attempt to reach out to the removed nodes, but this does not mean that they were still a part of the replica set. The output of rs.status() should confirm this. The behavior you're describing has been seen in the past, and there are a number of tickets currently tracking work that is going into improving the replica set node removal process:

SERVER-36415

SERVER-36416

SERVER-36417

-Nick

Generated at Thu Feb 08 04:43:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.