[SERVER-56251] Alleviate problems that arise when OperationContext::markKilled is called with a non-Interruption error Created: 21/Apr/21 Updated: 30/Jan/24 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matthew Saltz (Inactive) | Assignee: | Backlog - Service Architecture |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | needs-epic, sa-remove-fv-backlog-22, techdebt | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Assigned Teams: |
Service Arch
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Service Arch 2022-06-13, Service Arch 2022-06-27, Service Arch 2022-07-11 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 7 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Some places in the code rely on catching an exception in ErrorCategory::Interruption to check whether an OperationContext has been interrupted. This is a problem if any callsites ever call OperationContext::markKilled with an error code that isn't in this error category, but there's not currently anything preventing that from happening. Technically, using the Interruption category to check for OperationContext interrupt is also error prone since it's possible that other things could throw an Interruption error, so call sites which need to check for interrupt should probably catch all exceptions and then actually check the OperationContext to see if it's been interrupted. This ticket should either:
— We determined that the appropriate resolution is to remove the Interruption category. The reason being Interruption has built up a lot of cruft and has lost meaning over time. Currently, several bugs have been logged to address sections of the code base that use this category. Once those are complete, the actual removal of the category should be trivial. |
| Comments |
| Comment by Gregory Noma [ 12/Sep/22 ] |
|
louis.williams@mongodb.com in general I think that scenario isn't an issue because that issue requires setting the global kill flag. |
| Comment by Andy Schwerin [ 12/Sep/22 ] |
|
I think |
| Comment by Louis Williams [ 12/Sep/22 ] |
|
Even if we remove the error category, what prevents someone from writing code that throws an InterruptedAtShutdown error without marking the OpCtx as killed and callers treating this as an error? See |
| Comment by Gregory Noma [ 02/Sep/22 ] |
|
matt.diener@mongodb.com I just merged in |
| Comment by Matt Diener (Inactive) [ 28/Jun/22 ] |
|
There are several pieces of code that pass a non-`Interruption` error into `markKilled`. We could expand the meaning of `Interruption`, but that would further erode its usefulness. Alternatively, we can visit all call sites of `markKilled` which currently do not pass an `Interruption` error and swap out the error code for something that fits the category. Each of these instances present the risk of breaking some code that uses the specific error code, and each change would require investigation to ensure we're not breaking anything. I am going as far as to argue that those investigations would be more expensive than the investigations associated with removing code that references `ErrorCategory::Interruption`. The first is an investigation into hypothetical places where we check an error code, where the later is an investigation into specific pieces of code that are very likely to be simultaneously encountering false positive and false negative matches of specifically `ErrorCategory::Interruption`. If it is true that the ErrorCategory is being used as an indicator that the opCtx is killed (it seems to be the case from my investigation), those pieces of code need to be fixed anyway. The opCtx can be killed without encountering an Interruption error (even if we expand its definition), and an Interruption error can be observed in cases where the opCtx is not killed. We're making the decision to follow Andy's suggestion and kill the ErrorCategory altogether. I am ticketing out work to remove references. |
| Comment by Andy Schwerin [ 31/Mar/22 ] |
|
I'd like to go a step further, and propose that we remove the error category entirely, to avoid future confusion. |
| Comment by Andy Schwerin [ 16/Nov/21 ] |
|
I'm just seeing this ticket now. I think the right approach is the second one described in the description:
Looking at the error codes labeled with the Interruption error category, I don't see a fundamental relationship between them. They just seem to be codes that someone decided they would call `markKilled` with in some circumstances. |
| Comment by Max Hirschhorn [ 14/Oct/21 ] |
|
Wanted to note that the lack of clarity around the valid inputs to OperationContext::markKilled() has been the source of a production crash tracked by |
| Comment by Matthew Saltz (Inactive) [ 28/Apr/21 ] |
|
The reason this ticket came up was that a newly added component was doing markKilled with CallbackCanceled and it caused something in a random place to not catch an exception because it was only expecting Interruption-category codes. So as a quick fix we changed it to call markKilled with "Interrupted" but that's less than ideal, hence these suggestions |
| Comment by Louis Williams [ 26/Apr/21 ] |
|
I don't think your first suggestion will be possible until we fix some of the index building interrupt behavior. Your second option sounds fine, but it may not be necessary. I have a theory that any recipients of non-interruption error codes are probably already doing this, otherwise, they might not work correctly. This is just based on my observations about the index building code. |
| Comment by Matthew Saltz (Inactive) [ 22/Apr/21 ] |
|
Thanks for that info. I guess another option for the first bullet point would also be to add any error we use to kill an OperationContext to the Interruption error category - but that seems like it could have other knock-on side effects. Based on your input it sounds like we should go with the second approach of fixing call sites that use Interruption errors as a way to find out whether the OperationContext was killed. In which case this ticket could possibly turn into doing an audit of those sites and creating new tickets for each case, depending on how many there are and how difficult it seems to be to update the logic. (I think there are around 10-20 based on a grep I did a few weeks ago.) What do you think? |
| Comment by Louis Williams [ 22/Apr/21 ] |
|
For better or for worse, we use the IndexBuildAborted error code to internally abort index builds. This allows us to both interrupt the index builder thread and communicate that it was killed by a caller that is taking responsibility for cleaning up its state. As a result, it skips certain cleanup tasks. It would be better to communicate that information through other means, but that is where the code is currently. |