[SERVER-48480] Abort initial sync upon transition to REMOVED state Created: 28/May/20 Updated: 29/Oct/23 Resolved: 10/Jun/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Vesselina Ratcheva (Inactive) | Assignee: | Matthew Russotto |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Sprint: | Repl 2020-06-15 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 18 | ||||||||||||||||||||
| Description |
|
That would probably cause the initial sync attempt to fail eventually, but it might be worth writing something that checks for REMOVED state proactively and helps us fail quickly and gracefully. |
| Comments |
| Comment by Githook User [ 09/Jun/20 ] |
|
Author: {'name': 'Matthew Russotto', 'email': 'matthew.russotto@10gen.com', 'username': 'mtrussotto'}Message: |
| Comment by Judah Schvimer [ 01/Jun/20 ] |
|
requiresGhostCommitTimestampForCatalogWrite is used for deciding to ghost timestamp index build writes. One idea is to include REMOVED in this check. louis.williams confirms this is not entirely safe. He does note that on 4.6+, "It’s safe on master because single-phase builds only happen synchronously with replication and don’t use ghost timestamps any more. Single-phase = empty collection". So I think on 4.4 we need to abort initial sync on entering REMOVED. On 4.6+ we just want to add REMOVED to this check to relax the invariant. |
| Comment by Daniel Gottlieb (Inactive) [ 01/Jun/20 ] |
|
Given REMOVED can be ephemeral and not terminal, the choice of what to do is more interesting. The way I see it, the invariant is a tool. Writes that happen in a REMOVED state has so far fallen entirely under replication's responsibility, so it's your call on whether you want to opt-in to that tool or not. |
| Comment by Judah Schvimer [ 01/Jun/20 ] |
|
daniel.gottlieb, are you alright with relaxing the invariant? If we think that's safe that is preferable to aborting initial sync. REMOVED can be a transient state if a subsequent reconfig adds a node back in, or if the transition into REMOVED is due to a DNS error and nodes retry to find themselves in their config with We would only want to relax the invariant if timestamping in REMOVED will occur correctly. |
| Comment by Daniel Gottlieb (Inactive) [ 01/Jun/20 ] |
|
|