[SERVER-48480] Abort initial sync upon transition to REMOVED state Created: 28/May/20  Updated: 29/Oct/23  Resolved: 10/Jun/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.4.0

Type: Bug Priority: Major - P3
Reporter: Vesselina Ratcheva (Inactive) Assignee: Matthew Russotto
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Issue split
split to SERVER-48530 Relax invariant around timestamping f... Closed
Related
is related to SERVER-35649 Nodes removed due to isSelf failure s... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2020-06-15
Participants:
Linked BF Score: 18

 Description   

That would probably cause the initial sync attempt to fail eventually, but it might be worth writing something that checks for REMOVED state proactively and helps us fail quickly and gracefully.



 Comments   
Comment by Githook User [ 09/Jun/20 ]

Author:

{'name': 'Matthew Russotto', 'email': 'matthew.russotto@10gen.com', 'username': 'mtrussotto'}

Message: SERVER-48480 Abort initial sync upon transition to REMOVED state
Branch: v4.4
https://github.com/mongodb/mongo/commit/0ba63f264cc0be3bbc77e35ed94306c394ca95d9

Comment by Judah Schvimer [ 01/Jun/20 ]

requiresGhostCommitTimestampForCatalogWrite is used for deciding to ghost timestamp index build writes. One idea is to include REMOVED in this check.

louis.williams confirms this is not entirely safe. He does note that on 4.6+, "It’s safe on master because single-phase builds only happen synchronously with replication and don’t use ghost timestamps any more. Single-phase = empty collection".

So I think on 4.4 we need to abort initial sync on entering REMOVED. On 4.6+ we just want to add REMOVED to this check to relax the invariant.

Comment by Daniel Gottlieb (Inactive) [ 01/Jun/20 ]

Given REMOVED can be ephemeral and not terminal, the choice of what to do is more interesting. The way I see it, the invariant is a tool. Writes that happen in a REMOVED state has so far fallen entirely under replication's responsibility, so it's your call on whether you want to opt-in to that tool or not.

Comment by Judah Schvimer [ 01/Jun/20 ]

daniel.gottlieb, are you alright with relaxing the invariant? If we think that's safe that is preferable to aborting initial sync. REMOVED can be a transient state if a subsequent reconfig adds a node back in, or if the transition into REMOVED is due to a DNS error and nodes retry to find themselves in their config with SERVER-35649. If it's transient, not having failed initial sync could safe a lot of time.

We would only want to relax the invariant if timestamping in REMOVED will occur correctly.

Comment by Daniel Gottlieb (Inactive) [ 01/Jun/20 ]

SERVER-42497 is only in 4.4

Generated at Thu Feb 08 05:17:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.