Killing an OplogWriter operation using killOp() results in a crash

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Replication
    • ALL
    • Hide

      I was able to reproduce this bug by writing a test that pauses the OplogWriter on a secondary node and use currentOp() and killOp() to kill the OplogWriter operation.

      Show
      I was able to reproduce this bug by writing a test that pauses the OplogWriter on a secondary node and use currentOp() and killOp() to kill the OplogWriter operation.
    • Repl 2025-10-13, Repl 2025-10-27
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      In the past few months we've had various AFs (ex: AF-1462) because a customer is runnign killOp() commands and killing internal replication operations, which lead to a crash of the mongod process.

      We recently tried to improve logging when this happens (see: SERVER-101858), but it doesn't fix the issue as it catches the exception and the does a fassert().

      When reproducing this bug in a test we get logs like this:

      [js_test:killOp_against_repl_threads] d20041| {"t":{"$date":"2025-10-09T23:32:04.950+00:00"},"s":"I",  "c":"COMMAND",  "id":558700,  "ctx":"conn1","msg":"Successful killOp","attr":{"remote":"127.0.0.1:60526","metadata":{"application":{"name":"MongoDB Shell"},"driver":{"name":"MongoDB Internal Client","version":"8.3.0-alpha0"},"os":{"type":"Linux","name":"Ubuntu","architecture":"aarch64","version":"22.04"}},"db":"admin","command":{"killOp":1,"op":38914,"lsid":{"id":{"$uuid":"ee5c83c6-52b8-4f0e-9ef4-5b212507a834"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1760052722,"i":2}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$readPreference":{"mode":"secondaryPreferred"},"$db":"admin"}}}
      ...
      [js_test:killOp_against_repl_threads] d20041| {"t":{"$date":"2025-10-09T23:32:05.004+00:00"},"s":"I",  "c":"REPL",     "id":10185800,"ctx":"OplogWriter-0","msg":"OplogWriter threw a DBException","attr":{"what":"operation was interrupted","exception":"Interrupted: operation was interrupted"}}
      [js_test:killOp_against_repl_threads] d20041| {"t":{"$date":"2025-10-09T23:32:05.004+00:00"},"s":"F",  "c":"ASSERT",   "id":23089,   "ctx":"OplogWriter-0","msg":"Fatal assertion","attr":{"msgid":10185801,"location":"src/mongo/db/repl/oplog_writer.cpp:62:31:auto mongo::repl::OplogWriter::startup()::(anonymous class)::operator()(const executor::TaskExecutor::CallbackArgs &)"}}
      [js_test:killOp_against_repl_threads] d20041| {"t":{"$date":"2025-10-09T23:32:05.004+00:00"},"s":"F",  "c":"ASSERT",   "id":23090,   "ctx":"OplogWriter-0","msg":"\n\n***aborting after fassert() failure\n\n"}
      [js_test:killOp_against_repl_threads] d20041| {"t":{"$date":"2025-10-09T23:32:05.004+00:00"},"s":"F",  "c":"CONTROL",  "id":6384300, "ctx":"OplogWriter-0","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}
      

      Our current documentation states the user should not try to kill internal DB operations, but this is not foolproof.

      I suggest we do one of the following options to prevent this crash from happening:

      1. [ Preferred ] Prevent the killOp() command from killing internal Repl operations.
      2. Only allow users with internal privilege (on top of killop) to kill an internal operation.

      The first option is preferred since there is no valid reason for a user or an operator to kill an internal Repl operation that I could find.

            Assignee:
            Pierre Turin
            Reporter:
            Pierre Turin
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated: