[SERVER-67615] Stop using ErrorCategory::Interruption in Query codebase Created: 28/Jun/22  Updated: 29/Oct/23  Resolved: 26/Jul/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Matt Diener (Inactive) Assignee: Mindaugas Malinauskas
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-56251 Alleviate problems that arise when Op... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: QE 2022-07-25, QE 2022-08-08
Participants:

 Description   

Context

`ErrorCategory::Interruption` has had its definition change over time. When investigating SERVER-56251, we determined that some parts of our codebase use `ErrorCategory::Interruption` as an indicator that the opCtx was killed, while other pieces of code simply use it to indicate that something was interrupted in a specific API.

The solution to bugs over time has been to expand this error category to include more errors, which has eroded the utility of this category.

Pieces of code which catch this error category as an indication that the opCtx was killed are unfortunately incorrect because:

  1. An Interruption error can be observed without a killed opCtx (these errors are generic).
  2. A non-`Interruption` error can be observed when the opCtx is killed. It never made that guarantee, and investigations are showing it cannot make that guarantee without even further expanding the `Interruption` category.
  3. Even if the opCtx always threw an `Interruption`-category error, there is nothing stopping some other in-between layer from catching that exception and throwing something else.

Pieces of code which catch this error category as an indication of their API being interrupted are possibly incorrect because the definition of this error category has changed over time.

The end goal of this and related bugs is to eliminate all uses of `ErrorCategory::Interruption` and then remove the category altogether.

 

Acceptance criteria

Remove references to `ErrorCategory::Interruption` in the files specified below. Understand the intention of the existing usage and use your judgement to re-implement that logic in a more robust way.

 

Files in question

It's possible not all of these are owned by your team. Please reach out to matt.diener@mongodb.com if we should re-assign a subset of this work elsewhere.

# src/mongo/db/pipeline
1) change_stream_expired_pre_image_remover.cpp, performExpiredChangeStreamPreImagesRemovalPass

Solution(s)

Here are some potential fixes:

1) If we are catching the exception and assuming the opCtx is cancelled, we should catch ALL exceptions and check the opCtx directly:

catch (DBException& e) {
    ...
    if (!opCtx->getKillStatus().OK()) {
        // We now know the opCtx is actually killed.
        // We do not know whether this exception was raised by opCtx.
    }
    throw; // if appropriate
}

 

2) If we are using the `Interruption` category for something that has nothing to do with the opCtx, create a new category in `error_codes.yml` that is tied to the component using the category. Be deliberate about exactly which errors belong to that category. The expansion of the `Interruption` category caused that component's behavior to be altered slightly.

 

3) If we are encountering an assert that an error we have fits this category, investigate what the error category was meant to indicate and find another thing to assert on, or use a new error category if necessary.

 

4) If none of the above apply, use your best judgement, consider reaching out to matt.diener@mongodb.com to discuss ways this can be resolved.



 Comments   
Comment by Githook User [ 25/Jul/22 ]

Author:

{'name': 'Mindaugas Malinauskas', 'email': 'mindaugas.malinauskas@mongodb.com'}

Message: SERVER-67615 Stop using ErrorCategory::Interruption in the change stream pre-images purging job
Branch: master
https://github.com/mongodb/mongo/commit/0e465a0d7adefc42b9399fcbde011d712785ad76

Comment by Matt Diener (Inactive) [ 29/Jun/22 ]

This bug belongs to a category of bugs which got divided across teams. This is not blocking feature work, but is related to a problem that people have encountered in multiple BFs and HELP tickets (SERVER-60685 links a few of these).

We'd like to see all of these bugs fixed over the next few months so we can take the final step to remove the error category (Jul-Sept).

Comment by Kyle Suarez [ 29/Jun/22 ]

matt.diener@mongodb.com, what is the priority of this ticket? Is it blocking the implementation of some other feature?

CC mindaugas.malinauskas@mongodb.com as it looks like the file in question here is related to pre-images expiration.

Generated at Thu Feb 08 06:08:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.