[SERVER-46897] REMOVED node may never send heartbeat to fetch newest config Created: 16/Mar/20 Updated: 29/Oct/23 Resolved: 24/Mar/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 4.2.4, 4.4.0-rc0 |
| Fix Version/s: | 4.4.0-rc0, 4.0.20, 4.2.8, 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | William Schultz (Inactive) | Assignee: | William Schultz (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | safe-reconfig-related | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.4, v4.2, v4.0, v3.6
|
||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: |
|
||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2020-04-06 | ||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 36 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
When a replica set node installs a config that it is not a member of, it enters the REMOVED state. While in REMOVED state, it keeps track of any other node that sends it a heartbeat request in a seed list. If a node n1 is currently REMOVED and receives a heartbeat from node n0, it will add n0 to its seed list. If n1 then learns of a newer config that it is still not a member of, it will install this config and cancel its outgoing heartbeats. It will not reschedule any heartbeats, though, since it is still REMOVED in its current config. It will also not clear its seed list, since that only happens when heartbeats are restarted. So, this means that the node is currently REMOVED, and its seed list contains node n0, and it is not heartbeating any other node. If n0 then executes a reconfig that adds n1 back into the set, n0 will never learn of it because it only schedules a heartbeat to fetch a config if its seed list set changes. It will remain on a stale config indefinitely. To fix this issue, we may want to clear the seed list any time a node installs a new config that it is not a member of. |
| Comments |
| Comment by Githook User [ 29/Jun/20 ] | ||
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: (cherry picked from commit c2b282a1ba5f4f87d59456912073398a504281dc) | ||
| Comment by Githook User [ 27/May/20 ] | ||
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: (cherry picked from commit c2b282a1ba5f4f87d59456912073398a504281dc) | ||
| Comment by Githook User [ 26/Mar/20 ] | ||
|
Author: {'name': 'William Schultz', 'username': 'will62794', 'email': 'william.schultz@mongodb.com'}Message: (cherry picked from commit c2b282a1ba5f4f87d59456912073398a504281dc) | ||
| Comment by William Schultz (Inactive) [ 25/Mar/20 ] | ||
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: | ||
| Comment by William Schultz (Inactive) [ 16/Mar/20 ] | ||
|
Note that normally a REMOVED node shouldn't learn of a newer config in which it is still REMOVED, since an upstream node that installs a newer config won't be heartbeating a REMOVED node. When a node executes a reconfig from Ci to Cj, where some node is REMOVED in Ci but not in Cj, the node will execute a quorum check that sends a heartbeat to the REMOVED node. That node may then schedule a heartbeat to fetch Ci from the upstream node and install it. This is one case where a node may already be REMOVED but it fetches and installs a newer config where it is still REMOVED. This is what happens in the attached repro. More specifically, if we move through the following configs:
then n1 may still be on C3 when n0 executes the quorum check for the reconfig from C4 to C5. The heartbeat sent from n0 to n1 during this quorum check will prompt n1 to fetch C4 and install it, even though it's still REMOVED in C4. | ||
| Comment by William Schultz (Inactive) [ 16/Mar/20 ] | ||
|
I tested this on 4.2 and it appears to also be an issue there. I have not tested on any earlier version but I expect this issue goes back at least a few versions. |