Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-47130

Reduce w:majority commits from 4 to 2 in 2 PC

    • Cluster Scalability

      As of 4.2 our 2 PC implementation requires 4x majority commits. This is usually the dominating portion of commit latency.

      (Below persist = w:majority commit)

      1. transaction coordinator shard persists information about what other participants are part of this transaction.
      2. all participants persist the prepare phase of their part of the transaction
      3. transaction coordinator persists the commit or abort decision
      4. all participants persist commit or abort as directed by the coordinator

      Step 1 can be avoided by folding it into step 2:

      As part of the prepare message, coordinator embeds the list of participants and the identity of the coordinator. (The coordinator is one of the participants.) Participants persist this information together with persisting the prepare state of the transaction itself.

      Recovery in case of failure:

      2a) Remember that the original coordinator is itself a replica set / shard. Assuming that the original coordinator had succeeded in persisting its prepare message, and succeeds in electing a new primary and connecting to the rest of the cluster, then the new primary can simply resume the role of the coordinator and resend the prepare message to all participants. (In this scenario the coordinator can continue acting as the coordinator until the end of the 2PC protocol.)

      2b) If the original coordinator fails before it had succeeded in persisting its own prepare message, it will have lost state that is necessary to continue as coordinator. (In fact, it doesn't even know it is a participant in this 2PC transaction!) Since at least 1 participant has failed the prepare phase and lost transaction state, the only possible outcome is that the transaction must be aborted. The other participants can learn about this by: after a timeout, poll (with readConcern=majority) all other participants for their state and if at least one other participant has lost the transaction, they must abort too. (It's also possible that a participant enters this state and learns that at least one participant has committed the transaction. In this case it can also commit its transaction.)

      2c) If none of the participants have received or persisted the prepare message, then the transaction will eventually time out on each of them.

      In step 3 the coordinator will first persist its abort or commit decision for itself. (Same as current behavior.) As per SERVER-37364 it can now return the result of the transaction to the client.

      In step 4 the coordinator sends abort/commit to all participants. In case of any failure, the coordinator can just keep resending the message until all participants have acknowledged it.

            backlog-server-cluster-scalability [DO NOT USE] Backlog - Cluster Scalability
            henrik.ingo@mongodb.com Henrik Ingo (Inactive)
            0 Vote for this issue
            21 Start watching this issue