[SERVER-33747] Arbiter tries to start data replication if cannot find itself in config after restart Created: 08/Mar/18 Updated: 29/Oct/23 Resolved: 02/Nov/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.6.3, 3.7.2 |
| Fix Version/s: | 4.2.11, 4.0.22, 3.6.22, 4.4.3, 5.0.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | William Schultz (Inactive) | Assignee: | A. Jesse Jiryu Davis |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | former-quick-wins | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
|||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
|||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | |||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | |||||||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.4, v4.2, v4.0, v3.6, v3.4
|
|||||||||||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: |
|
|||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2020-08-10, Repl 2020-09-07, Repl 2020-11-02, Repl 2020-11-16 | |||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | |||||||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 50 | |||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Consider the following scenario.
The fundamental issue is that we should not be starting data replication if we are an arbiter. See the attached repro script help_6042.js To fix this, one approach may be to never start data replication if we can't find ourselves in the local replica set config. If we can't find ourselves in the config, we should enter the REMOVED state, and we shouldn't need to start replicating until we become a proper member of the replica set. We could perhaps rely on the ReplicationCoordinatorImpl::_heartbeatReconfigStore to make sure that we start data replication whenever we receive a heartbeat that brings us back as a valid node into the config. It already has a check for whether or not we should start data replication. |
| Comments |
| Comment by Githook User [ 20/Nov/20 ] | ||||||||||||||||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: (cherry picked from commit 72aacd4ffaf6500777a8a51f87b0797f8ea8ad0b) | ||||||||||||||||||||
| Comment by Githook User [ 20/Nov/20 ] | ||||||||||||||||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: (cherry picked from commit 72aacd4ffaf6500777a8a51f87b0797f8ea8ad0b) | ||||||||||||||||||||
| Comment by Githook User [ 09/Nov/20 ] | ||||||||||||||||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: (cherry picked from commit 72aacd4ffaf6500777a8a51f87b0797f8ea8ad0b) | ||||||||||||||||||||
| Comment by Githook User [ 09/Nov/20 ] | ||||||||||||||||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: (cherry picked from commit 72aacd4ffaf6500777a8a51f87b0797f8ea8ad0b) | ||||||||||||||||||||
| Comment by Githook User [ 02/Nov/20 ] | ||||||||||||||||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: | ||||||||||||||||||||
| Comment by Siyuan Zhou [ 09/Sep/20 ] | ||||||||||||||||||||
|
judah.schvimer, you are right that this ticket describes a solution that we shouldn't start data replication in REMOVED state. However, that diverges from the current convention that data replication is always running in REMOVED state. As a result, huayu.ouyang found that her patch doesn't start data replication after a node starts up after restoring from a backup, which contains a config that doesn't contain the node, meaning the node is REMOVED. The two solutions mentioned above are for this new problem.
I feel this ticket needs more investigation, so I'd suggest holding off this ticket. CC tess.avitabile. Huayu has other project tickets in this iteration. | ||||||||||||||||||||
| Comment by Ali Mir [ 10/Aug/20 ] | ||||||||||||||||||||
|
In my investigation, I've verified that this bug exists on both master and 4.4. The arbiter will not be able to find itself in the config (after calling validateConfigForStartup and then findSelfInConfig()), and puts itself into the REMOVED state. Afterwards, it will attempt to start data replication, and an invariant will fail. Will's solution sounds reasonable to me. If the node is put into the REMOVED state and never starts data replication, it will remain REMOVED until the primary is issued a reconfig to re-add the node to the config, after which it can identify itself as an arbiter. | ||||||||||||||||||||
| Comment by Siyuan Zhou [ 21/Jul/20 ] | ||||||||||||||||||||
|
ali.mir, could you please investigate this issue and see if it could happen on master and 4.4? The goal is to understand the problem and the solution proposal and try to reproduce it on master and 4.4. | ||||||||||||||||||||
| Comment by Judah Schvimer [ 06/Jan/20 ] | ||||||||||||||||||||
|
This is marked as depends on "Safe Reconfig". We will reconsider this after that project completes. | ||||||||||||||||||||
| Comment by Louisa Berger [ 19/Mar/18 ] | ||||||||||||||||||||
|
Not sure if this makes a difference in your backporting / etc, but I also just saw this on 3.4.13 on Debian.
|