[SERVER-47130] Reduce w:majority commits from 4 to 2 in 2 PC Created: 26/Mar/20  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Henrik Ingo (Inactive) Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 0
Labels: ShardedTxn:FutureOptimizations
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-63971 Change server parameter to default to... Closed
is related to SERVER-37364 Coordinator should return the decisio... Closed
Assigned Teams:
Cluster Scalability
Participants:

 Description   

As of 4.2 our 2 PC implementation requires 4x majority commits. This is usually the dominating portion of commit latency.

(Below persist = w:majority commit)

1. transaction coordinator shard persists information about what other participants are part of this transaction.
2. all participants persist the prepare phase of their part of the transaction
3. transaction coordinator persists the commit or abort decision
4. all participants persist commit or abort as directed by the coordinator

Step 1 can be avoided by folding it into step 2:

As part of the prepare message, coordinator embeds the list of participants and the identity of the coordinator. (The coordinator is one of the participants.) Participants persist this information together with persisting the prepare state of the transaction itself.

Recovery in case of failure:

2a) Remember that the original coordinator is itself a replica set / shard. Assuming that the original coordinator had succeeded in persisting its prepare message, and succeeds in electing a new primary and connecting to the rest of the cluster, then the new primary can simply resume the role of the coordinator and resend the prepare message to all participants. (In this scenario the coordinator can continue acting as the coordinator until the end of the 2PC protocol.)

2b) If the original coordinator fails before it had succeeded in persisting its own prepare message, it will have lost state that is necessary to continue as coordinator. (In fact, it doesn't even know it is a participant in this 2PC transaction!) Since at least 1 participant has failed the prepare phase and lost transaction state, the only possible outcome is that the transaction must be aborted. The other participants can learn about this by: after a timeout, poll (with readConcern=majority) all other participants for their state and if at least one other participant has lost the transaction, they must abort too. (It's also possible that a participant enters this state and learns that at least one participant has committed the transaction. In this case it can also commit its transaction.)

2c) If none of the participants have received or persisted the prepare message, then the transaction will eventually time out on each of them.

In step 3 the coordinator will first persist its abort or commit decision for itself. (Same as current behavior.) As per SERVER-37364 it can now return the result of the transaction to the client.

In step 4 the coordinator sends abort/commit to all participants. In case of any failure, the coordinator can just keep resending the message until all participants have acknowledged it.



 Comments   
Comment by Andy Schwerin [ 01/May/23 ]

Note that we stepped back from SERVER-37364 in SERVER-63971, because users preferred to get external read-your-writes behavior. Finding a read-your-writes solution here that still cuts out majority writes would be interesting.

Comment by Mira Carey [ 27/Mar/20 ]

This is a great suggestion, and we're interested in pursuing something like this in the future.

At the current moment, we're de-emphasizing work that provides constant factor improvements in performance in sharding, to focus on correctness and feature gaps. Given that this change would only affect users using distributed transactions with performance constraints (and wouldn't get us all the way to letting users use distributed transactions sanely on their hot path), we'll have to revisit this in the future.

Thanks very much for the write up

Generated at Thu Feb 08 05:13:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.