[SERVER-39627] Audit uses of TransactionParticipant::SideTransactionBlock Created: 15/Feb/19  Updated: 17/Jun/19  Resolved: 17/Jun/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: Judah Schvimer
Resolution: Won't Fix Votes: 0
Labels: prepare_basic
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File patch    
Issue Links:
Backports
Related
related to SERVER-40466 Unify checks for inMultiDocumentTrans... Closed
is related to SERVER-36494 Prevent oplog truncation of oplog ent... Closed
Backport Requested:
v4.2
Sprint: Repl 2019-06-03, Repl 2019-06-17, Repl 2019-07-01
Participants:

 Description   

We use SideTransactionBlock to do things outside a transaction, while in a multi-document transaction. E.g., when OpObserverImpl writes the prepare oplog entry or (in SERVER-36494) when TransactionParticipant writes to another collection in the "local" DB.

It seems error-prone that a side transaction block only partly masks the state of the outer transaction. For example, state methods like TransactionParticipant::inMultiDocumentTransaction() reflect the outer transaction.

This can cause mistakes when recursing into the TransactionParticipant while doing an operation in a side transaction via

TransactionParticipant method -> OpObserverImpl method -> TransactionParticipant::addTransactionOperation

Audit uses of SideTransactionBlock and consider extending the SideTransactionBlock to more thoroughly mask evidence of the TransactionParticipant's outer transaction.



 Comments   
Comment by Siyuan Zhou [ 14/Jun/19 ]

Great analysis, judah.schvimer! I agreed the original concern by Jesse should be resolved using UnreplicatedWritesBlock, since missing UnreplicatedWritesBlock when it's needed sounds a more fundamental issue anyway. Doing nothing makes sense to me.

Comment by Andy Schwerin [ 14/Jun/19 ]

SGTM

Comment by Judah Schvimer [ 14/Jun/19 ]

I'm proposing to do nothing in this ticket. There is definitely still a chance of deadlock using the ACR improperly, but I think for now it makes more sense to fix those one by one. If someone wants to advocate for finishing the POC of having ACRs inherit lockers, I'm open to it, but do not want to invest the time right now to prove this is safe given siyuan.zhou's concern. I will close this ticket "Won't Fix" in a few days if I hear nothing else, though we can always reopen it.

Comment by Judah Schvimer [ 12/Jun/19 ]
  1. OplogSlotReserver
    1. Does not re-enter TransactionParticipant.
  2. profile
    1. Creating the system.profile collection if needed is in an UnreplicatedWritesBlock.
    2. Collection creation isn't allowed in transactions so there's nothing else that the OpObserver does on collection creation in a transaction that it does not do normally. This is however a place where we could easily introduce a bug when we add DDL op support in transactions. I don't have a good forward thinking way of preventing this.
    3. We do re-enter the TransactionParticipant for individual system.profile inserts. In OpObserverImpl::onInserts we check if we have a WUOW as a proxy for if we're doing an insert into system.profile in a transaction. We then choose not to add the operation to the transaction based on that proxy.
    4. Sharding also checks for chunks that intersect with inserts in transactions. This is done unnecessarily for system.profile inserts as a result of mistakenly thinking the insert is happening in a transaction, but it doesn't seem dangerous, just unnecessary work. These inserts cannot be part of a chunk migration.
  3. prepare oplog entry
    1. This actually expects to re-enter the participant in the transaction.
    2. Looking at all uses of the participant in OpObserverImpl, TP::onWriteOpCompletedOnPrimary() and TP::addTransactionOperation() are the only functions that change participant state.
      1. TP::onWriteOpCompletedOnPrimary(): This function needs to be called within a transaction and expects a STB.
      2. TP::addTransactionOperation(): This could be modified to prevent accidentally adding operations that occur in STBs to mongodb-transactions, but using UWBs when needed seems less complex, and this change doesn't feel appropriately motivated.

Thus I do not see any work to do now based on this audit.

Comment by Judah Schvimer [ 12/Jun/19 ]

The original problem this ticket was addressing is hinted at at the bottom of this cr on line 2148. An older version of SERVER-36494 tried to add a new usage of SideTransactionBlock (STB) that did a write that was intended to be unreplicated. At first the STB didn't have an UnreplicatedWritesBlock (UWB). When the OpObserver fired for the write in the STB, it attempted to add a new operation to the mongodb transaction that attempted to do the side-transaction write. Since it was not in an UWB, this succeeded and led to a bug where the unreplicated write was put into the transaction oplog entry. While a greater "masking" of transaction state in STB would have fixed this problem, an UWB also would have fixed the problem in a less invasive way. The write was unreplicated and so an UWB was appropriate regardless.

The goal of this ticket is to replace STB with ACR. To fully do this we would need to give ACR similar treatment to SERVER-40498 in terms of the alternative client inheriting the locker of the parent client. This is only necessary for the OplogSlotReserver. That said, as Tess points out, ACRs and STBs not sharing lockers is deadlock prone. Currently there are no known deadlocks in ACR usages, but the OplogSlotReserver would have a deadlock if SERVER-40498 didn't make STB share a locker with its parent.

This implies two separate changes:

  1. Make the ACR share its parent locker
    1. This could be done mostly on its own
  2. Replace STB with ACR
    1. This requires ACR to share a locker, or at least to create a special mechanism for the OplogSlotReserver to address SERVER-40498.

There are three usages of STB today:

  1. Writing the prepare oplog entry
    1. The OpObserver code, as can be seen in this POC currently makes many assumptions that it can reach back into the TransactionParticipant. One example that was not even addressed in the POC is for deciding to write to the config.transactions table. To complete the POC and replace this STB with an ACR, we would need to remove all places in the OpObserver that call back into the TransactionParticipant, which would be a non-trivial refactor.
    2. I don't currently see any further re-entrance problems here, certainly not of the variety this ticket was originally created for. If we do leave this as is, I will need to double check this.
  2. The OplogSlotReserver
    1. all this cares about is having a side-transaction that gets thrown away but needs to share a locker of the parent. STB seems like the right construct for this as it exists today.
  3. Profiling inside of a transaction
    1. Also not addressed in this POC, this currently relies on maintaining information about the parent client. profile() includes the client address, which an ACR would hide. Sharing a locker here is also important since this does take locks, sometimes database X locks if the system.profile collection needs to be created.
    2. Similarly to the prepare oplog entry, this needs slightly further audit for re-entrance problems, though I don't anticipate any.

It seems to me that basically all of the concerns are around making the ACR share its parent locker. Andy is concerned about currentOp, killOp, and killSession. Siyuan is concerned about the fundamental assumptions here. As Tess said though, sharing the locker is important in terms of preventing deadlocks. This seems like the only real motivating purpose of this change right now.

Siyuan's concern is valid and I personally do not know what could go wrong by breaking the 1:1 mapping from client:locker.

To address Andy's concern we'll go through the three concerning commands and how the two changes suggested would impact them:

  1. Replace STB with ACR
    1. currentOp
      1. For a short period of time during commands that get profiled or during prepareTransaction or commitTransaction, currentOp will start showing a different client similar to other ACRs, this is very short lived in all use cases and should be inconsequential.
      2. We could only do this if we're in a multi-document transaction if that were desired.
    2. killOp
      1. In the short period of time where currentOp shows different clients, the opId that users see and try to kill will be incorrect. It is unlikely users will try to kill an operation during this time since users generally try to kill long running operations.
    3. killSession
      1. This will kill the original OperationContext (and not the new one the ACR creates) since that's what checked out the session. It will abort any unprepared transactions on the session like it does already.
      2. The operation will not check for interrupt while in the ACR and so that will not be interrupted, but it may be interrupted once the ACR goes out of scope.
      3. This behavior is already the case in any of the ACR usages around the codebase today and will not become significantly more prevalent with this change.
  2. ACRs inheret lockers
    1. currentOp
      1. Lock stats will now be shared by the ACR and its parent. This is the behavior we actually likely want. Right now we're forgetting some of the locks a given operation acquires when we use an ACR and are effectively giving an inaccurate depiction of what the operation actually did.
    2. killOp
      1. This is unrelated to the locker.
    3. killSession
      1. This is unrelated to the locker.

The biggest change appears to be the lock stats becoming more accurate, which would be a win.

This ends up being a mixed bag. I'd love for some outside input on Siyuan's more general concern. In the meantime I will actually do the brief audit this ticket requested.

Comment by Siyuan Zhou [ 05/Jun/19 ]

I think the existing one to one mapping among Client -> OperationContext -> Locker is a fundamental assumption. SERVER-40498 didn't change this assumption and just let the locker opt out the WUOW as if the locks were acquired outside WUOW. Sharing lockers across clients will break this assumption. I'm wondering about geert.bosch's opinion on this.

Conceptually, it's not clear to me why side transaction has to mask all states of the running operation context as suggested by this ticket. I feel there are more states we want to keep than stash/mask. Locker is probably not the only. The statistics of the operation, session checkout and whether the operation should conflict with PBWM lock are also bound to operation context that I can think of off the top of my head. Client seems too high-level to isolate the resources here.

At the code level, when working on SERVER-40498, I felt the only resource we want to stash is the recovery unit / WT transaction since we may not want to or be able to use the existing recovery unit. However, since WUOW manages the lifetime of recovery unit and two-phase locking, we'd have to "release" the resources. Because of transactions, the lifetimes of recovery unit, locker, two-phase locking are not always the same any longer. If we could have RAII types to manage the lifetime of recovery unit and two-phase locking, separated from the locker itself, side transaction just needs to stash the first two. A WUOW will hold instances of those RAII types. 

Comment by Eric Milkie [ 05/Jun/19 ]

The code above I wrote for a section that intended to be a handler for an op being killed; the reason for using a new Client was to clear out the current killCode attached to the originalOpCtx. If one doesn't do this, no further db locking is possible in an abort handler since the lock would check the kill code and immediately rethrow an Interrupted exception.
So the answer to Andy's question is, this was originally a special case where killOp was being handled with special code. You should be aware of this if you want to use this technique in other places.

Comment by Judah Schvimer [ 05/Jun/19 ]

milkie may have thought about that. I need to do further investigation and testing to find out.

Comment by Andy Schwerin [ 05/Jun/19 ]

What does that do for currentOp, killOp and killSessions?

Comment by Judah Schvimer [ 05/Jun/19 ]

milkie provided the following which works for a given ACR (though doesn't work in the change from SERVER-40498 to opt out of two phase locking for the locks already held):

+    // Make a new operation context in order to clear the kill code, if one is present.
+    auto currentLocks = originalOpCtx->swapLockState(stdx::make_unique<LockerImpl>());
+    auto cleanupClient = originalOpCtx->getServiceContext()->makeClient("Index build cleanup");
+    AlternativeClientRegion acr(cleanupClient);
+    // After this point, cleanupClient is held inside acr and is not accessible until acr falls out
+    // of scope.  The new Alternative Client can only be accessed via cc().
+    auto cleanupOpCtx = cc().makeOperationContext();
+    auto opCtx = cleanupOpCtx.get();
+    opCtx->swapLockState(std::move(currentLocks));
...
+    ON_BLOCK_EXIT([&] {
+        auto currentLocks = opCtx->swapLockState(stdx::make_unique<LockerImpl>());
+        originalOpCtx->swapLockState(std::move(currentLocks));
+    });

Comment by Judah Schvimer [ 31/May/19 ]

I don't think a PoC would be too hard. I'll do it next iteration. I knew this ticket might be bigger than originally expected.

Comment by Andy Schwerin [ 30/May/19 ]

I'm definitely interested in the PoC. We'd need to think about how current-op reporting would look. All the work in the ACR gets assigned to a different client and op context, which might affect metrics.

Comment by Tess Avitabile (Inactive) [ 30/May/19 ]

On the other hand, having an ACR that does not share the locker is prone to self-deadlock. 

Comment by Andy Schwerin [ 30/May/19 ]

I'd look at a PoC, but having an ACR where the alternate client shares OpCtx state (the locker) with the principal client's OpCtx seems error-prone, too. How hard to poc, judah.schvimer?

Comment by Judah Schvimer [ 29/May/19 ]

One concern with the above plan is that it somewhat is a layering violation between the Locker and the Client. The Client would have to release and restore the OperationContext's WUOW similar to in SERVER-40498 without unlocking the locks. This should be straightforward and is likely what users of AlternativeClientRegion want. I suspect this would link just fine since client.cpp and operation_context.cpp are both in the service_context library. Given that this prevents complicated to diagnose/fix deadlocks, I think it is worthwhile.

siyuan.zhou, schwerin, geert.bosch, do any of you have any thoughts on this? I can create a POC if that seems helpful.

Comment by Tess Avitabile (Inactive) [ 29/May/19 ]

judah.schvimer and I discussed the following strategy for SideTransactionBlock. We could get rid of SideTransactionBlock entirely and use AlternativeClientRegion everywhere instead. This would have the advantage that the readConcern on the OperationContext and the TransactionParticipant would both be fresh, so code that checks these constructs does not need to consider whether we are in a SideTransactionBlock.

Due to SERVER-40498, this means AlternativeClientRegion would need to share the locker of the parent OperationContext. This would ensure that the thread does not deadlock with itself, and that the max lock timeout for transactions is respected (since the max lock timeout for transactions should apply as long as the thread can block the transaction's progress). SideTransactionBlock has both of these properties today.

As part of this work, we could eliminate checks of OperationContext::getWriteUnitOfWork(), since this should now return the same result as TransactionParticipant::inMultiDocumentTransaction(). If this leads to linking issues, we could consider adding an OperationContext flag/decoration that indicates whether the operation is a statement of a multi-document transaction (i.e. it has autocommit:false). This might be preferable anyway, since most users probably expect this to be the semantics of TransactionParticipant::inMultiDocumentTransaction(), when in fact that function returns false once the transaction is aborted/committed.

Comment by A. Jesse Jiryu Davis [ 22/Feb/19 ]

I attached a patch atop git version #2c65bbe9, it's a sketch of one direction we could go to fix the reentrance problem.

Comment by Judah Schvimer [ 15/Feb/19 ]

blake.oler, I could imagine your work looking at the transaction statements at commit time being impacted by this. The prepare oplog entry or the config.transactions write may be added to the operations list in the prepare oplog entry SideTransactionBlock. I have not confirmed this, however.

Generated at Thu Feb 08 04:52:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.