[SERVER-36597] primary still contact removed member and now stepdown when major members are down Created: 10/Aug/18 Updated: 04/Sep/18 Resolved: 13/Aug/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.6.6 |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Bruce Zu | Assignee: | Nick Brewer |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Participants: |
| Description |
|
all details have been provide in https://jira.mongodb.org/browse/SERVER-36512 but this topic focus on 2 new found issues I am confused with Nick's feedback. If the Nick's answer is right, then where are the down major members? because of the replset has 1 primary 2 secondaries and 1 arbiter, only one secondary was unreadable when the issue is found. I did more investigation on the mongod.log find another 2 issues: issue: primary still contact removed 3 members, fixed by restart mongod service issue: primary did not step down if it still thinks the replset has 6 data bearing members and 1 arbiter and 4 data bearing members are down. All 3.6.6 now
|
| Comments |
| Comment by Nick Brewer [ 13/Aug/18 ] |
|
brucezu The arbiter does not count toward the read concern majority. With one secondary down in the setup you just described, you would not be able to fulfill a majority read concern. This is not a bug.
As I mentioned previously, we have seen this behavior in the past - however it does not mean that the node that the primary is attempting to connect to is still considered to be a member of the replica set. This is unrelated to your issue of not being able to fulfill a read concern majority, and I've linked to the current SERVER tickets that we have tracking improvements in this area. -Nick |
| Comment by Bruce Zu [ 13/Aug/18 ] |
|
Hi Nick The output of rs.status() shows 4 members in the resplet secondary : 172.31.54.204 (primary when the issue happen on Aug 6)arbiter : ip-172-31-5-208 (was 3.4.7 when the issue happen on Aug 6) This also can be tracked from mongod.log The question here is only one for them, 1/3, was unreachable. major data bearing members are still available but lookaside table started to grow.
In short, I think there is a bug: lookaside table is triggered to grow when major data bearing members are still available.
by the way, Test show when a member is removed from replset. the removed member still tries to connect primary, and primary accept the connection from the removed member. But primary will never connect the removed member. But In my case, the primary member actively connect the removed member. This is not expected. it should be a bug. After reboot mongod service. this issue disappears.
|
| Comment by Nick Brewer [ 13/Aug/18 ] |
|
The primary may have continued to attempt to reach out to the removed nodes, but this does not mean that they were still a part of the replica set. The output of rs.status() should confirm this. The behavior you're describing has been seen in the past, and there are a number of tickets currently tracking work that is going into improving the replica set node removal process: -Nick |