[SERVER-32185] Freshly synced secondaries respond to queries before their "sync time" Created: 06/Dec/17  Updated: 27/Oct/23  Resolved: 08/Dec/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Backlog - Replication Team
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-30809 Investigating remaining writes to the... Closed
Related
related to SERVER-32237 Nodes that cannot become primary must... Closed
is related to SERVER-32226 oldest_timestamp should track the las... Closed
is related to SERVER-30577 Clear list of stable timestamp candid... Closed
Assigned Teams:
Replication
Operating System: ALL
Backport Requested:
v3.6
Participants:

 Description   

The last phase of a secondary performing initial sync is to apply oplog operations up through some time `T` representing when the collection cloning phase completed. It's incorrect for a secondary to respond to majority read/at a timestamp queries before time T.

When a secondary comes out of initial sync, it will still have a notion of the replica sets majority commit time. Because the majority commit time is translated to a "read at a timestamp", the secondary will incorrectly respond to a query, but with a view of inconsistent data.

A couple starting points for solutions:

  1. An API was introduced for recover to a stable timestamp known as the "initial data timestamp" that replication sets when initial sync completes. This represents the timestamp at which the data is in a consistent state. This could be used to reject/block incoming majority reads/read at a timestamp requests.
  2. Alternatively, a secondary can refuse to come out of initial sync until the majority commit point passes `T`. Currently there is no mechanism to tell drivers which timestamps a node can service reads for. This solution would be a way to signal to drivers to not send majority reads the node cannot service, at the cost of not participating in reads `>= T`.


 Comments   
Comment by Daniel Gottlieb (Inactive) [ 08/Dec/17 ]

I think I flubbed making this ticket. After talking with judah.schvimer, taking a fresh look at trying to reproduce with logs on master, I think what I was observing was really SERVER-32187.

Comment by Judah Schvimer [ 08/Dec/17 ]

Per conversation with daniel.gottlieb, it appears that doing majority reads at the stable timestamp should be sufficient. We seem to be doing exactly this, so it's unclear what's going on. This will still leave us open to a rollback right after initial sync requiring a resync. From SERVER-32237, that may be impossible to avoid without requiring users to initiate their nodes as non-voting and then reconfig them to be voting members.

Comment by Judah Schvimer [ 06/Dec/17 ]

I like the user visible behavior of staying in initial sync rather than restarting initial sync if we roll back shortly after leaving initial sync. My only concern is relying on secondaries in initial sync to commit writes. Per conversation with milkie, this is no different than behavior we have today, and it should work, but it definitely feels weird.

Comment by Daniel Gottlieb (Inactive) [ 06/Dec/17 ]

I suggest we just never set the committed snapshot to an inconsistent snapshot. We should be able to do something similar to SERVER-30577.

That may be a reasonable option as well. I didn't think replication kept its "moved out of initial sync" time and that's why we introduced a setInitialDataTimestamp time. But, if I'm wrong, I don't see any fundamental reason why your suggestion wouldn't work.

Comment by Eric Milkie [ 06/Dec/17 ]

In 3.6 we no longer create any named snapshots, so there is no longer any "blessing" mechanism – the logic is completely different now.

Comment by Daniel Gottlieb (Inactive) [ 06/Dec/17 ]

3.6 yes, 3.4 I don't think so. This should be backported, yes.

Comment by Judah Schvimer [ 06/Dec/17 ]

Why are we blessing snapshots as "committed" if they're inconsistent? I think we already have a mechanism for blocking reads when no majority snapshot is available. I suggest we just never set the committed snapshot to an inconsistent snapshot. We should be able to do something similar to SERVER-30577.

Comment by Judah Schvimer [ 06/Dec/17 ]

daniel.gottlieb, Does this affect 3.6 and need to be backported?

Generated at Thu Feb 08 04:29:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.