[SERVER-44603] Consider having tailable readPreference "primary" queries killed on stepdown Created: 13/Nov/19  Updated: 07/Apr/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: 4.2.0, 4.3.1
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Alan Zheng
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-21537 chainingAllowed = false not being enf... Closed
is related to DOCS-13520 Update read preference documentation ... Closed
is related to SERVER-39621 Disabled chaining should enforce sync... Closed
Participants:
Case:

 Description   

Queries with an explicit readPreference: "primary" are currently allowed to survive stepdown. This behavior is reasonable when the results are bounded. I.e: some results were returned and the remaining results from a getmore are just as consistent as if the node were still a primary.

However for clients tailing a capped collection (e.g: the oplog), there is no longer a way to guarantee between the driver and server that once a query is opened against a primary, that the node continues to be primary. Applications that desire this guarantee must implement something on their end such as periodically re-issuing the query, or having some side-channel monitoring the replica set state.



 Comments   
Comment by Bernard Gorman [ 10/Aug/20 ]

I agree with the original ticket description re: the distinction between a bounded regular query and a tailable cursor, and I can see a fair case that users would desire different behaviour for each. However, I think Arnie is correct that there are also plenty of occasions where a user would prefer a long-running regular query to stay off the new Primary if a node steps up during election, or conversely where they might want the query to migrate to the new Primary on stepdown. But we obviously can't revert to something like the old 4.0 (?) behaviour where we kill all queries on stepdown, as that would be far too disruptive to anything that isn't a change stream (though PM-915 may eliminate this difference to a large extent).

Since this seems like a case where deciding on the most desirable behaviour is a toss-up and is as likely to annoy customers as to help them, why not give them the option to choose the appropriate behaviour?

What if we were to introduce a new parameter in the readPreference spec, something like {strict: <boolean>} or, more explicitly, {reassessAfterElection: <boolean>}? This would default to 'false' to maintain the current behaviour, but if the client sets it to 'true' then we would revalidate the read preference each time we check out a cursor. That way, a find, aggregate or getMore which is running during an election would be allowed to complete, but the following getMore will throw InterruptedDueToReplStateChange if the node's new role no longer satisfies the read preference. Any operations which are resumable would then be re-targeted to the appropriate post-election nodes and re-issued.

That way, customers could choose (on a per-operation basis) whether they want to prioritise cross-election query survival OR keeping workloads on/off particular nodes. The current behaviour would be maintained by default and would therefore not be a Versioned API violation, and the change would be relatively simple - we would not have to build new machinery to proactively seek out and kill operations every time there's an election.

Comment by Andy Schwerin [ 17/Mar/20 ]

oplog tailing aside, I think the current behavior is correct. For the changestream and oplog tailing case, I'm less certain, because those queries are logically moving through time. I'm still not thrilled with hanging up on stepdown for queries that don't have a clear route to resumption (not restart), but maybe for collections where resumption is possible (oplog/changestream) this change or a similar one could make sense.

Comment by Daniel Gottlieb (Inactive) [ 13/Nov/19 ]

Alternatively, it may be worthwhile to update the documentation.

Default mode. All operations read from the current replica set primary.

Generated at Thu Feb 08 05:06:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.