[SERVER-56251] Alleviate problems that arise when OperationContext::markKilled is called with a non-Interruption error Created: 21/Apr/21  Updated: 30/Jan/24

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matthew Saltz (Inactive) Assignee: Backlog - Service Architecture
Resolution: Unresolved Votes: 0
Labels: needs-epic, sa-remove-fv-backlog-22, techdebt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-67606 Stop using ErrorCategory::Interruptio... Closed
depends on SERVER-67611 Stop using ErrorCategory::Interruptio... Closed
depends on SERVER-67615 Stop using ErrorCategory::Interruptio... Closed
depends on SERVER-67617 Stop using ErrorCategory::Interruptio... Closed
depends on SERVER-67618 Stop using ErrorCategory::Interruptio... Closed
Related
related to SERVER-55379 Invariant failure _requests.empty() a... Closed
related to SERVER-60685 TransactionCoordinator may interrupt ... Closed
related to SERVER-68808 Catch CancellationError in ReplClient... Closed
related to SERVER-85912 Audit code paths explicitly checking ... Backlog
is related to SERVER-70010 Stop using getKillStatus to check for... Closed
is related to SERVER-55323 Integrate CancelableOperationContext ... Closed
is related to SERVER-78722 CancelableOperationContext::cancel() ... Open
Assigned Teams:
Service Arch
Sprint: Service Arch 2022-06-13, Service Arch 2022-06-27, Service Arch 2022-07-11
Participants:
Linked BF Score: 7

 Description   

Some places in the code rely on catching an exception in ErrorCategory::Interruption to check whether an OperationContext has been interrupted. This is a problem if any callsites ever call OperationContext::markKilled with an error code that isn't in this error category, but there's not currently anything preventing that from happening. Technically, using the Interruption category to check for OperationContext interrupt is also error prone since it's possible that other things could throw an Interruption error, so call sites which need to check for interrupt should probably catch all exceptions and then actually check the OperationContext to see if it's been interrupted.

This ticket should either:

  • Add an invariant to markKilled to make sure all error codes passed to it belong to ErrorCategory::Interruption, and fix the broken call sites, or
  • Fix all places we catch Interruption errors and rely on that to assume the OperationContext has been interrupted to catch all DBExceptions and check the OperationContext itself for interrupt

We determined that the appropriate resolution is to remove the Interruption category. The reason being Interruption has built up a lot of cruft and has lost meaning over time. Currently, several bugs have been logged to address sections of the code base that use this category. Once those are complete, the actual removal of the category should be trivial.



 Comments   
Comment by Gregory Noma [ 12/Sep/22 ]

louis.williams@mongodb.com in general I think that scenario isn't an issue because that issue requires setting the global kill flag.

Comment by Andy Schwerin [ 12/Sep/22 ]

I think SERVER-69496 has a local problem that we would have caught at code
review today.

Comment by Louis Williams [ 12/Sep/22 ]

Even if we remove the error category, what prevents someone from writing code that throws an InterruptedAtShutdown error without marking the OpCtx as killed and callers treating this as an error? See SERVER-69496.

Comment by Gregory Noma [ 02/Sep/22 ]

matt.diener@mongodb.com I just merged in SERVER-67611 which is the last ticket currently listed as a dependency. However, I see that there are still several usages of ErrorCodes::isInterruption which I imagine will also need to be removed prior to this ticket.

Comment by Matt Diener (Inactive) [ 28/Jun/22 ]

There are several pieces of code that pass a non-`Interruption` error into `markKilled`. We could expand the meaning of `Interruption`, but that would further erode its usefulness. Alternatively, we can visit all call sites of `markKilled` which currently do not pass an `Interruption` error and swap out the error code for something that fits the category. Each of these instances present the risk of breaking some code that uses the specific error code, and each change would require investigation to ensure we're not breaking anything.

I am going as far as to argue that those investigations would be more expensive than the investigations associated with removing code that references `ErrorCategory::Interruption`. The first is an investigation into hypothetical places where we check an error code, where the later is an investigation into specific pieces of code that are very likely to be simultaneously encountering false positive and false negative matches of specifically `ErrorCategory::Interruption`.

If it is true that the ErrorCategory is being used as an indicator that the opCtx is killed (it seems to be the case from my investigation), those pieces of code need to be fixed anyway. The opCtx can be killed without encountering an Interruption error (even if we expand its definition), and an Interruption error can be observed in cases where the opCtx is not killed.

We're making the decision to follow Andy's suggestion and kill the ErrorCategory altogether. I am ticketing out work to remove references.

Comment by Andy Schwerin [ 31/Mar/22 ]

I'd like to go a step further, and propose that we remove the error category entirely, to avoid future confusion.

Comment by Andy Schwerin [ 16/Nov/21 ]

I'm just seeing this ticket now. I think the right approach is the second one described in the description:

Fix all places we catch Interruption errors and rely on that to assume the OperationContext has been interrupted to catch all DBExceptions and check the OperationContext itself for interrupt

Looking at the error codes labeled with the Interruption error category, I don't see a fundamental relationship between them. They just seem to be codes that someone decided they would call `markKilled` with in some circumstances.

Comment by Max Hirschhorn [ 14/Oct/21 ]

Wanted to note that the lack of clarity around the valid inputs to OperationContext::markKilled() has been the source of a production crash tracked by SERVER-60685.

Comment by Matthew Saltz (Inactive) [ 28/Apr/21 ]

The reason this ticket came up was that a newly added component was doing markKilled with CallbackCanceled and it caused something in a random place to not catch an exception because it was only expecting Interruption-category codes. So as a quick fix we changed it to call markKilled with "Interrupted" but that's less than ideal, hence these suggestions

Comment by Louis Williams [ 26/Apr/21 ]

I don't think your first suggestion will be possible until we fix some of the index building interrupt behavior. Your second option sounds fine, but it may not be necessary. I have a theory that any recipients of non-interruption error codes are probably already doing this, otherwise, they might not work correctly. This is just based on my observations about the index building code.

Comment by Matthew Saltz (Inactive) [ 22/Apr/21 ]

Thanks for that info. I guess another option for the first bullet point would also be to add any error we use to kill an OperationContext to the Interruption error category - but that seems like it could have other knock-on side effects.

Based on your input it sounds like we should go with the second approach of fixing call sites that use Interruption errors as a way to find out whether the OperationContext was killed. In which case this ticket could possibly turn into doing an audit of those sites and creating new tickets for each case, depending on how many there are and how difficult it seems to be to update the logic. (I think there are around 10-20 based on a grep I did a few weeks ago.) What do you think?

Comment by Louis Williams [ 22/Apr/21 ]

For better or for worse, we use the IndexBuildAborted error code to internally abort index builds.

This allows us to both interrupt the index builder thread and communicate that it was killed by a caller that is taking responsibility for cleaning up its state. As a result, it skips certain cleanup tasks. It would be better to communicate that information through other means, but that is where the code is currently.

Generated at Thu Feb 08 05:38:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.