[SERVER-39949] Replace uses of UninterruptibleLockGuard in TransactionParticipant Created: 04/Mar/19  Updated: 06/Dec/22  Resolved: 17/May/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Judah Schvimer Assignee: Backlog - Replication Team
Resolution: Won't Fix Votes: 0
Labels: prepare_errors, prepare_optional
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-39643 Replace UninterruptibleLockGuard with... Closed
Related
is related to SERVER-38190 killOp while committing a prepared tr... Closed
is related to SERVER-40051 Make committingWithoutPrepare state u... Closed
Assigned Teams:
Replication
Participants:

 Description   

We've added UninterruptibleLockGuards in certain parts of the TransactionParticipant to prevent interruptibility. UninterruptibleLockGuards don't stop all types of interruptions, just ones in locks. This should be done after SERVER-39643 based on the outcome of that discussion.



 Comments   
Comment by Judah Schvimer [ 07/Mar/19 ]

I moved it per your suggestion. I am concerned that we could add another interrupt point and forget to change this. That said, it would lead to a crash, not data corruption, so we can fix it then. Moved it to "prepare_optional".

Comment by Siyuan Zhou [ 07/Mar/19 ]

judah.schvimer, given Dan and Eric's comments above, I think it's correct to use UninterruptibleLockGuard since the only thing that can call checkForInterrupt() for the above two cases is lock acquisition. The case in SERVER-39643 is different since we need to check out a session there, which isn't covered by UninterruptibleLockGuard. It might be a good code cleanup to use runWithoutInterruptionExceptAtGlobalShutdown instead, but I don't think it's urgent. I'd suggest putting this ticket to backlog.

Comment by Eric Milkie [ 06/Mar/19 ]

Yes; on primaries, we take steps to make sure that assumption is true even on one-node replica sets by setting the stable optime carefully (see ReplicationCoordinatorImpl::_chooseStableOpTimeFromCandidates()). It's always true on secondaries because the logic for doing writes is different there.

Comment by Daniel Gottlieb (Inactive) [ 06/Mar/19 ]

A node cannot checkpoint in front of the all_committed (i.e: the stable_timestamp cannot be in front of the all_committed). Making that true in all configurations (single voting primaries, EMRC=off) isn't trivial, but fundamentally, that invariant must hold or else replication recovery wouldn't start from the correct point in the oplog and could miss entries earlier than the "checkpoint timestamp".

Comment by Judah Schvimer [ 06/Mar/19 ]

it will be safe if a node can never take a checkpoint beyond all_committed.

daniel.gottlieb or milkie, this sounds like a fair assumption, do you agree? We only take checkpoints at stable_timestamps which are always behind the all_committed by definition, so I think we're good.

Comment by Siyuan Zhou [ 06/Mar/19 ]

judah.schvimer pointed out the only places that check for interrupts in those two cases are all about lock acquisition, so I think it's at least correct to keep using UninterruptibleLockGuards for now. When we want to make shutdown not blocked by these two oplog writes in the future, we could use runWithoutInterruptionExceptAtGlobalShutdown instead and allow exceptions thrown. They will be caught higher in the stack to return an error status to the client. The data should be consistent anyway as Judah mentioned above.

Comment by Siyuan Zhou [ 06/Mar/19 ]

The one concern is taking a checkpoint that includes the commit in the data without the commit oplog entry having been written. The "Oplog Durability" project should make this safe because the commit will not be in the data if the commit oplog entry hasn't been written.

This will be a problem anyway, with or without shutdown. Do we have an idea to avoid that for steady state when we don't have to pin stable timestamp to the oldest active transaction timestamp? I'm not sure whether the "Oplog Durability" project will save us here because it only guarantees the durable timestamps for both data write and oplog write are the same. If a checkpoint is taken in between, it can still only include the data without the oplog write, unless the stable timestamp is aware of oplog holes. For example, it will be safe if a node can never take a checkpoint beyond all_committed.

Comment by Judah Schvimer [ 06/Mar/19 ]

There are two UninterruptibleLockGuards in the TransactionParticipant. Both are uninterruptible because an interruption would be caught in a try...catch and cause a std::terminate(). I don't think it is safe for these places to be interrupted even at shutdown. If it needs to be safe, then we have to replace the std::terminate() in the catch with something that will leave the node in the correct state.

Marking the opCtx as opCtx->setIsExecutingShutdown seems like the wrong thing to do here, though. I think we need to use runWithoutInterruptionExceptAtGlobalShutdown() as siyuan.zhou says, but catch interrupt exceptions (which will always be due to shutdown in this case).

The two places are:

  1. commitPreparedTransaction: We write the commit oplog entry after committing the storage transaction. If we shutdown too early, we will recover with the transaction in prepare. The one concern is taking a checkpoint that includes the commit in the data without the commit oplog entry having been written. The "Oplog Durability" project should make this safe because the commit will not be in the data if the commit oplog entry hasn't been written.
  2. prepareTransaction abortGuard: If we fail to prepare the transaction but have written a prepare oplog entry, then on restart we will put the transaction back into prepare. if that fails, the abortGuard will fire outside of shutdown uninterruptibly.
Comment by Siyuan Zhou [ 05/Mar/19 ]

After SERVER-39427, using runWithoutInterruptionExceptAtGlobalShutdown() seems more reasonable here, it will stop all interruptions except shutdown.

Generated at Thu Feb 08 04:53:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.