Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-67617

Stop using ErrorCategory::Interruption in Replication codebase

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 6.1.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
    • Fully Compatible
    • ALL
    • Repl 2022-08-08


      `ErrorCategory::Interruption` has had its definition change over time. When investigating SERVER-56251, we determined that some parts of our codebase use `ErrorCategory::Interruption` as an indicator that the opCtx was killed, while other pieces of code simply use it to indicate that something was interrupted in a specific API.

      The solution to bugs over time has been to expand this error category to include more errors, which has eroded the utility of this category.

      Pieces of code which catch this error category as an indication that the opCtx was killed are unfortunately incorrect because:

      1. An Interruption error can be observed without a killed opCtx (these errors are generic).
      2. A non-`Interruption` error can be observed when the opCtx is killed. It never made that guarantee, and investigations are showing it cannot make that guarantee without even further expanding the `Interruption` category.
      3. Even if the opCtx always threw an `Interruption`-category error, there is nothing stopping some other in-between layer from catching that exception and throwing something else.

      Pieces of code which catch this error category as an indication of their API being interrupted are possibly incorrect because the definition of this error category has changed over time.

      The end goal of this and related bugs is to eliminate all uses of `ErrorCategory::Interruption` and then remove the category altogether.


      Acceptance criteria

      Remove references to `ErrorCategory::Interruption` in the files specified below. Understand the intention of the existing usage and use your judgement to re-implement that logic in a more robust way.


      Files in question

      It's possible not all of these are owned by your team. Please reach out to matt.diener@mongodb.com if we should re-assign a subset of this work elsewhere.

      # src/mongo/db/repl
      1) oplog_batcher.cpp, OplogBatcher::getNextApplierBatch + OplogBatcher::_run
      2) repl_client_info.cpp, ReplClientInfo::setLastOpToSystemLastOpTimeIgnoringInterrupt
      3) repl_set_commands.cpp, CmdReplSetTest::run
      4) replication_coordinator_impl.cpp, ReplicationCoordinatorImpl::processReplSetGetStatus 


      Here are some potential fixes:

      1) If we are catching the exception and assuming the opCtx is cancelled, we should catch ALL exceptions and check the opCtx directly:

      catch (DBException& e) {
          if (!opCtx->getKillStatus().OK()) {
              // We now know the opCtx is actually killed.
              // We do not know whether this exception was raised by opCtx.
          throw; // if appropriate


      2) If we are using the `Interruption` category for something that has nothing to do with the opCtx, create a new category in `error_codes.yml` that is tied to the component using the category. Be deliberate about exactly which errors belong to that category. The expansion of the `Interruption` category caused that component's behavior to be altered slightly.


      3) If we are encountering an assert that an error we have fits this category, investigate what the error category was meant to indicate and find another thing to assert on, or use a new error category if necessary.


      4) If none of the above apply, use your best judgement, consider reaching out to matt.diener@mongodb.com to discuss ways this can be resolved.

            sean.zimmerman@mongodb.com Sean Zimmerman
            matt.diener@mongodb.com Matt Diener (Inactive)
            0 Vote for this issue
            5 Start watching this issue