Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 6.1.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
Repl 2022-08-08
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Context

`ErrorCategory::Interruption` has had its definition change over time. When investigating SERVER-56251, we determined that some parts of our codebase use `ErrorCategory::Interruption` as an indicator that the opCtx was killed, while other pieces of code simply use it to indicate that something was interrupted in a specific API.

The solution to bugs over time has been to expand this error category to include more errors, which has eroded the utility of this category.

Pieces of code which catch this error category as an indication that the opCtx was killed are unfortunately incorrect because:

An Interruption error can be observed without a killed opCtx (these errors are generic).
A non-`Interruption` error can be observed when the opCtx is killed. It never made that guarantee, and investigations are showing it cannot make that guarantee without even further expanding the `Interruption` category.
Even if the opCtx always threw an `Interruption`-category error, there is nothing stopping some other in-between layer from catching that exception and throwing something else.

Pieces of code which catch this error category as an indication of their API being interrupted are possibly incorrect because the definition of this error category has changed over time.

The end goal of this and related bugs is to eliminate all uses of `ErrorCategory::Interruption` and then remove the category altogether.

Acceptance criteria

Remove references to `ErrorCategory::Interruption` in the files specified below. Understand the intention of the existing usage and use your judgement to re-implement that logic in a more robust way.

Files in question

It's possible not all of these are owned by your team. Please reach out to matt.diener@mongodb.com if we should re-assign a subset of this work elsewhere.

# src/mongo/db/repl
1) oplog_batcher.cpp, OplogBatcher::getNextApplierBatch + OplogBatcher::_run

2) repl_client_info.cpp, ReplClientInfo::setLastOpToSystemLastOpTimeIgnoringInterrupt

3) repl_set_commands.cpp, CmdReplSetTest::run

4) replication_coordinator_impl.cpp, ReplicationCoordinatorImpl::processReplSetGetStatus

Solution(s)

Here are some potential fixes:

1) If we are catching the exception and assuming the opCtx is cancelled, we should catch ALL exceptions and check the opCtx directly:

catch (DBException& e) {
    ...
    if (!opCtx->getKillStatus().OK()) {
        // We now know the opCtx is actually killed.
        // We do not know whether this exception was raised by opCtx.
    }
    throw; // if appropriate
}

2) If we are using the `Interruption` category for something that has nothing to do with the opCtx, create a new category in `error_codes.yml` that is tied to the component using the category. Be deliberate about exactly which errors belong to that category. The expansion of the `Interruption` category caused that component's behavior to be altered slightly.

3) If we are encountering an assert that an error we have fits this category, investigate what the error category was meant to indicate and find another thing to assert on, or use a new error category if necessary.

4) If none of the above apply, use your best judgement, consider reaching out to matt.diener@mongodb.com to discuss ways this can be resolved.

is depended on by

SERVER-56251 Alleviate problems that arise when OperationContext::markKilled is called with a non-Interruption error

Backlog

related to

SERVER-68808 Catch CancellationError in ReplClientInfo

Closed

Assignee:: Sean Zimmerman
Reporter:: Matt Diener (Inactive)
Participants:: Githook User, Matt Diener, Sean Zimmerman
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Jun 28 2022 06:06:54 PM UTC
Updated:: Oct 29 2023 09:36:17 PM UTC
Resolved:: Aug 05 2022 03:08:32 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates