[SERVER-46499] revisit minOpTime logic in serverSelector Created: 28/Feb/20 Updated: 12/Dec/23 |
|
| Status: | Open |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.3 Desired |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Lamont Nelson | Assignee: | Backlog - Cluster Scalability |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Cluster Scalability
|
||||||||
| Participants: | |||||||||
| Description |
|
This logic was found to be required for test conformance. Currently, the server selector code does the equivalent of the old RSM implementation, which is:
I suspect that, from the client's perspective, this has similar runtime characteristics as just ignoring the minOpTime altogether. We should validate whether this is true, and if so remove this logic. |
| Comments |
| Comment by Lauren Lewis (Inactive) [ 21/Dec/21 ] |
|
We haven’t heard back from you in at least 1 year, so I'm going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket. |
| Comment by Lamont Nelson [ 19/Aug/20 ] |
|
Thanks for the analysis, at the time this was written there wasn't enough time to investigate further. I was operating under the assumption that this property was optional and not required for correctness, since that is the current implementation, which makes me question whether it is actually an effective one. I'm not sure either way without measurement, but I suspect that for the types of commands that are being run on the config server performance probably about the same. However, with your analysis it seems minOpTime (now renamed) is also required for safety in some cases. Since the behavior is optional this cannot be a safety property unless it's also enforced on the other end like with _exhaustiveFindOnConfig. So while most of the time the commands will run on an up to date server, when the system is under duress any server may be a target (all other things remaining equal). Intuitively, I think when things aren't operating in the normal manner is when users often want stronger guarantees, not weaker. I find this hard to reason about, and suspect there is some subset of commands where this is undesired behavior. Even if there are no undesired effects currently, I do find it confusing from an api perspective to have this option named minOpTime on read preference because, without knowing these details, one might reasonably think that this is a safety property and act accordingly. Maybe in |
| Comment by Kevin Pulo [ 12/Aug/20 ] |
|
I've been looking into this on The only users of minOpTime (AFAICT) are ShardRemote::_exhaustiveFindOnConfig() and ShardRemote::_scheduleCommand(). In both cases they are used to target a config server that is "current" according to configOpTime, rather than one which has an optime that is less than configOpTime. In the case of _scheduleCommand(), there is no other mechanism for ensuring that the command is run on a non-stale configsvr (if one is known). The command will just execute on the nearest configsvr, regardless of how stale it might be. This might not be good, and is an actual change in behaviour (I don't know which commands actually use this, or if any of them actually rely on this behaviour). In the case of _exhaustiveFindOnConfig(), configOpTime is also used for the afterOpTime field of the readConcern. In this sense, the minOpTime is an "optional optimisation", because it is the read concern's afterOpTime which ensures that the correct data is read and returned, ie. if the query targets a stale configsvr, that node will wait for its optime to advance to afterOpTime (configOpTime) before processing the read. However, this assumes that the stale configsvr will soon reach configOpTime, which may not be true. Consider the case of three configsvrs, where one is unhealthy and severely lagged or completely stuck (ie. its optime is not advancing), and this unhealthy node has the lowest ping time. The cluster as a whole is healthy, because the other 2 configsvrs are maintaining majority and majority writes. With minOpTime, queries (and commands) will be corrected targeted to one of the two healthy configsvrs. However, without minOpTime, the queries (and commands) will end up going to the unhealthy node, which may never return a result (or it might be very delayed). Considering that users/admins reasonably expect the cluster to be resilient to a minority of stale configsvrs (while whatever the underlying root cause is fixed, eg. bad hardware, or whatever) — in part because that's how it behaves today — removing the minOpTime targeting would add fragility to the cluster. So I think that the minOpTime targeting is still required (including the quirk where it gets ignored if it can't be satisfied). |