[SERVER-49533] Remove blocking work from Balancer's onStepUp methods Created: 15/Jul/20  Updated: 27/Oct/23  Resolved: 15/Nov/21

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: [DO NOT USE] Backlog - Sharding EMEA
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-49532 Remove OperationContext argument from... Backlog
is depended on by SERVER-49534 Remove ReplicaSetAwareService::onStep... Backlog
Assigned Teams:
Sharding EMEA
Participants:

 Description   

We should be trying to avoid doing blocking work in the stepUp path as much as possible. The balancer, however, does a blocking wait for the previous balancer run to complete, as well as a database i/o as part of stepup.

Instead, the balancer's onStepUpComplete method should schedule a thread to do this work asynchronously, without blocking stepup.



 Comments   
Comment by Kaloian Manassiev [ 15/Nov/21 ]

While there is some amount of blocking work under the Balancer stepUp/stepDown, it is performing local reads of very small number of documents. Given the amount of work necessary to fully remove it is too large and the negligible impact on stepUp, we will not do this ticket.

Comment by Spencer Brody (Inactive) [ 03/Aug/20 ]

Siyuan and I discussed this a bit over slack, but I figured I'd clarify here for posterity.

I don't believe that the addition of the PrimaryOnlyService API eliminates the need for or usefulness of the ReplicaSetAwareService API. The latter is lower level and lighter weight - all it does is let you get notified when a state transition into or out of Primary happens, but there's no extra machinery assisting or dictating how you respond to that fact. PrimaryOnlyService is a subclass of ReplicaSetAwareService that provides more functionality but is also more prescriptive of how it is used. Specifically the PrimaryOnlyService API is for state machines with a single persisted state document for each instance of the service, which need to be resumed where they left off after failover. The Balancer, for example, wouldn't fit nicely into the PrimaryOnlyService model, at least not in its current form.

Therefore I think deciding on the model of how ReplicaSetAwareServices should operate and handle synchronizing stepUp with service initialization work is something we want to decide independently of the PrimaryOnlyService project.

Comment by Tess Avitabile (Inactive) [ 24/Jul/20 ]

In the implementation of streamable isMaster, we ended up sending 3 isMaster responses as part of stepup, which you can see inĀ this test:

  • After entering drain mode (the 'primary' field exists, but ismaster=false)
  • After the reconfig to bump the term on stepup (the 'primary' field exists, but ismaster=false)
  • After exiting drain mode (ismaster=true)

Clients won't set the state to RSPrimary until ismaster=true (see SDAM), but it might be possible to change SDAM to do that.

When possible, it's preferable to block writes instead of causing retries, since retries add a network roundtrip, but I agree that retryability makes throwing more of a viable option.

Comment by Judah Schvimer [ 23/Jul/20 ]

quote
Or are you referring to Spencer's suggestion that services should be doing the blocking themselves so that only CRUD that touches these services is impacted?
quote

Yes this is what I was referring to. I still don't follow why throwing is so problematic in these cases, since most writes are retryable. For services that need to block all writes, using the drain mode mechanism makes sense to me.

Comment by Kaloian Manassiev [ 23/Jul/20 ]

What are these some writes? Or are you referring to Spencer's suggestion that services should be doing the blocking themselves so that only CRUD that touches these services is impacted?

Because we discussed this in the comments earlier, but just for your context so you don't have to read it: This is very difficult to achieve, because by the time the service is accessed it is too late - there are locks taken, etc, so the only recourse the service has is to throw.

For the two services that I listed - CollectionShardingState and VectorClock, it is all writes that need to be blocked until they recover, because all writes will access them anyways.

Comment by Judah Schvimer [ 23/Jul/20 ]

Do the non-replication services that do blocking work during drain mode need to happen before all writes are available? Or would the alternative of blocking writes outside of drain mode until those non-replication services do their step-up work only block some writes. If it's all writes, it makes sense to just be synchronous with drain mode, since that's the point of drain mode. Creating another tool for the same purpose seems confusing and unnecessary. If it's only some writes that would need to block though, then doing the non-replication step-up work asynchronously and only blocking those writes that are necessary makes sense. I think if you do the latter, coming up with some general framework in the primary-only service that facilitates this asynchronous scheduling and write blocking would be valuable.

As an aside, if we want drivers to discover primaries while they're in drain mode, there's no fundamental reason why they can't, and we could run a project to amend SDAM slightly to send isMaster responses on transition to Primary, before drain mode, and then send another isMaster update when drain mode is complete. This was suggested and deferred during the streamable isMaster design to simplify that already complicated project.

Comment by Kaloian Manassiev [ 23/Jul/20 ]

So then the question is, as we slowly move those pieces off over time, do we continue to allow them to do blocking i/o, or do we take that as an opportunity to rework them to no longer need to do so?

Actually, now that I think about it more - the Balancer should be easy, like you say and the TransactionCoordinator I think doesn't actually do any i/o under drain-mode. So I think this only leaves us with the CollectionShardingState and the VectorClock recovery, which we introduced in master. Both are really quick, since they read a single document each and changing them would be either difficult to do, or it would have little performance benefit, because everything interacts with them and it would be as if the entire node is stalled anyways.

So my personal preference would be to not impose that restriction for doing database i/o in drain mode. I have yet to see a case where this has been a problem. We used to also do network I/O under the drain-mode, so the config server not being available was a problem, but as of SERVER-32198 we don't even do that anymore.

judah.schvimer, this ticket is on the Sharding Backlog currently, but I am proposing that we don't do anything about it and spencer would like to defer that decision to you. It is about disallowing services doing any i/o under drain mode. My clam is that this is too restrictive.
What does the repl team think (you can just read the paragraph above for context)?

Comment by Spencer Brody (Inactive) [ 22/Jul/20 ]

Drain mode has always been fairly heavyweight, unfortunately. It actually might be slightly better now as it's holding the RSTL lock in X mode rather than the full GlobalLock in X mode like it used to. The RSTL lock was added while I was away, so I'm not actually sure what operations holding it in X mode blocks.

You're right that there's a whole lot of things under the shardingOnTransitionToPrimaryHook that depend on doing blocking i/o and changing all of them to no longer do so would be a major undertaking. When I filed this ticket I was just looking at the existing ReplicaSetAwareServices and saw that the Balancer was the only one doing blocking work, and it would be easy to stop the Balancer from doing so. I hadn't really considered the 'shardingOnTransitionToPrimaryHook' at the time.

I see the ReplicaSetAwareService interface as a way to eventually shrink down the ReplicationCoordinatorExternalState, so ideally in the far enough future the entire 'shardingOnTransitionToPrimaryHook' method in the external state would go away, with all of its functionality moved to ReplicaSetAwareServices. So then the question is, as we slowly move those pieces off over time, do we continue to allow them to do blocking i/o, or do we take that as an opportunity to rework them to no longer need to do so?

In my personal view of the hypothetical ideal future, there would be no non-replication blocking work done as part of stepUp, but maybe I'm being unrealistic. It's possible that by filing this ticket I was more expressing what I want the world to be than what we should objectively be working towards. I still agree with the core premise that doing blocking work inside of replication stepUp is risky and a potential layer violation, but as to whether it's enough of a problem to justify a lot of work to undo it is probably a question best decided by others.

At this point I think I've said all I can on the subject. As I'm not even on the replication team anymore, I don't think it's my place to speak to their priorities. I was working with the ReplicaSetAwareInterface and noticed this, so I decided to file a ticket describing what I see as a potential improvement to sharpen the borders between replication and the rest of the server, but I think it's up to replication team leadership to decide if they A) even agree with my interpretation of what the ideal state would be and B) if so, how big of a priority it actually is to change.

Comment by Kaloian Manassiev [ 22/Jul/20 ]

It should not be preventing step-up, but it should be preventing reads or writes slipping in, like you mention. My understanding was that this is what the drain mode does - the node is able to step-up and become primary, but it won't accept user requests. Did something happen recently that made this more heavyweight than it was before?

If I understand correctly, you are suggesting that instead of using the drain mode as a gate for all "services", which need to fulfil this requirement, instead each service tracks its own "recovery" progress and blocks if a code path accesses it.

Assuming that this is the case, our current command execution path does not yield itself for this, because by the time a service is accessed (for example, the CollectionShardingState or the TransactionCoordinator), there is already locks taken on the op-context, writes might have been done and so on. This means the only recourse we have is to throw an exception, which requires the caller to retry (StaleShardVersion is an example of this), but in some case even that's not possible.

Therefore, since we started using the draining mode, which I think was in 3.2, we have used that as a gate to ensure that all services have recovered, with the requirement that services' recovery must be "really fast".

Moving off to more granular recovery, would be a major undertaking IMO.

Comment by Spencer Brody (Inactive) [ 21/Jul/20 ]

In the case you describe about checking for failed migrations for shard version recovery, should that work be preventing stepup while it runs? If you're worried about a write slipping in before the sharding check is complete, I imagine it would be pretty straightforward to set a bit somewhere on stepup that instructs the write path to block until the necessary 'onStepUp' sharding work is complete. The stepUp critical section is a very heavy hammer, holding strong global locks that can interfere with many operations including monitoring/reporting ops, that we want to get out of as soon as possible. The sooner we complete stepUp, the sooner drivers can detect the new primary and start routing things appropriately. If there's a few milliseconds where those incoming writes block while some asynchronous stepup work completes, that seems fine to me, and definitely preferable to holding up the whole stepUp process for that same amount of time.

I honestly can't think of a single good reason any non-replication service should be able to get in the way of replication completing the transition to primary, and allowing them to do so seems like playing with fire.

Of course, this is all just my opinion at this point, you'd be better off talking to the current replication team to find out if they agree with my interpretation and how important it seems to them.

Comment by Kaloian Manassiev [ 21/Jul/20 ]

spencer, it is not always practical (or possible) to avoid doing database i/o as part of the stepUp path. For the balancer it doesn't matter much, but for example before opening up a node to become a primary, we need to check whether there are possibly any failed migrations for example before allowing any operations to enter. Otherwise we don't know whether a recovery of the shard version needs to be run.

Why is it a blocking requirement to remove the database i/o (other than maybe SERVER-49734)?

Comment by Spencer Brody (Inactive) [ 15/Jul/20 ]

This blocks SERVER-49532 and SERVER-49534 to clean up the ReplicaSetAwareService interface

Generated at Thu Feb 08 05:20:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.