[SERVER-84771] killOp during index build on SECONDARY raises assertion Created: 11/Jan/24  Updated: 16/Jan/24

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: 5.0.15
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vinicius Grippa Assignee: Backlog - Storage Execution Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Storage Execution
Operating System: ALL
Steps To Reproduce:
  • Create a dummy collection (mgeneratejs, {}for example)
  • Create index in the PRIMARY
    • replset [direct: primary] example> db.vinnie.createIndex({"employees.name": 1, "employees.position": 1 , "employees.age": 1})

  • Kill the OP in the SECONDARY
    • replset [direct: secondary] example> db.killOp(99491)
      {
        info: 'attempting to kill op',
        ok: 1,
        '$clusterTime': {
          clusterTime: Timestamp({ t: 1704994979, i: 1 }),
          signature: {
            hash: Binary(Buffer.from("26698c323652adfdb3a899aeab2335235d783b28", "hex"), 0),
            keyId: Long("7322887237878677508")
          }
        },
        operationTime: Timestamp({ t: 1704994979, i: 1 })
      }
      replset [direct: secondary] example>
      example> 

Participants:

 Description   

When running db.killOp() on a SECONDARY server to kill an ongoing index build, the database crashes with an assertion:

{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"555656425F25","b":"555652533000","o":"3EF2F25","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.361","s+":"215"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"5556564289A9","b":"555652533000","o":"3EF59A9","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"555656420DC6","b":"555652533000","o":"3EEDDC6","s":"abruptQuit","s+":"66"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"7FE510E99630","b":"7FE510E8A000","o":"F630","s":"_L_unlock_13","s+":"34"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"7FE510AF2387","b":"7FE510ABC000","o":"36387","s":"gsignal","s+":"37"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"7FE510AF3A78","b":"7FE510ABC000","o":"37A78","s":"abort","s+":"148"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"555653916865","b":"555652533000","o":"13E3865","s":"_ZN5mongo35fassertFailedWithStatusWithLocationEiRKNS_6StatusEPKcj","s+":"144"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"555653644087","b":"555652533000","o":"1111087","s":"_ZN5mongo22IndexBuildsCoordinator28_cleanUpTwoPhaseAfterFailureEPNS_16OperationContextERKNS_13CollectionPtrESt10shared_ptrINS_19ReplIndexBuildStateEERKNS0_17IndexBuildOptionsERKNS_6StatusE.cold.2338","s+":"19"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"555654A1ACBE","b":"555652533000","o":"24E7CBE","s":"_ZN5mongo22IndexBuildsCoordinator19_runIndexBuildInnerEPNS_16OperationContextESt10shared_ptrINS_19ReplIndexBuildStateEERKNS0_17IndexBuildOptionsERKN5boost8optionalINS_15ResumeIndexInfoEEE","s+":"5BE"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"555654A1B36B","b":"555652533000","o":"24E836B","s":"_ZN5mongo22IndexBuildsCoordinator14_runIndexBuildEPNS_16OperationContextERKNS_4UUIDERKNS0_17IndexBuildOptionsERKN5boost8optionalINS_15ResumeIndexInfoEEE","s+":"1FB"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"5556547214F1","b":"555652533000","o":"21EE4F1","s":"_ZZN5mongo15unique_functionIFvNS_6StatusEEE8makeImplIZNS_28IndexBuildsCoordinatorMongod16_startIndexBuildEPNS_16OperationContextENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_4UUIDERKSt6vectorINS_7BSONObjESaISG_EERKSE_NS_18IndexBuildProtocolENS_22IndexBuildsCoordinator17IndexBuildOptionsERKN5boost8optionalINS_15ResumeIndexInfoEEEEUlT_E5_EEDaOSW_EN12SpecificImpl4callEOS1_","s+":"371"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"5556561DFF65","b":"555652533000","o":"3CACF65","s":"_ZN5mongo10ThreadPool4Impl10_doOneTaskEPSt11unique_lockINS_12latch_detail5LatchEE","s+":"135"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"5556561E17BB","b":"555652533000","o":"3CAE7BB","s":"_ZN5mongo10ThreadPool4Impl13_consumeTasksEv","s+":"8B"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"5556561E2CDC","b":"555652533000","o":"3CAFCDC","s":"_ZN5mongo10ThreadPool4Impl17_workerThreadBodyERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE","s+":"26C"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"5556561E3280","b":"555652533000","o":"3CB0280","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_10ThreadPool4Impl25_startWorkerThread_inlockEvEUlvE2_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","s+":"60"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"5556565D12BF","b":"555652533000","o":"409E2BF","s":"execute_native_thread_routine","s+":"F"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"7FE510E91EA5","b":"7FE510E8A000","o":"7EA5","s":"start_thread","s+":"C5"}}}
{"t":{"$date":"2024-01-11T12:43:03.584-05:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"IndexBuildsCoordinatorMongod-3","msg":"Frame","attr":{"frame":{"a":"7FE510BBA9FD","b":"7FE510ABC000","o":"FE9FD","s":"clone","s+":"6D"}}} 

 



 Comments   
Comment by Louis Williams [ 11/Jan/24 ]

vgrippa@gmail.com, yes, I agree, it is not the most elegant behavior to crash when something isn't allowed. Luckily, we have improved this significantly in 7.1 so that the server crashes only in rare circumstances. Unfortunately, the killOp command is extremely forceful, and when interrupted, an operation cannot ignore that signal. But if a secondary stops building an index, the entire replica set will not make progress. This is why we recommend using the dropIndexes command instead, which is coordinated from the primary to ensure the index is safely canceled on all nodes in the replica set.

Even so, in cases where killOp currently crashes the server, one of our only options would be to restart the index build internally, rather than crash, because the secondary cannot independently choose not to build an index. While this would prevent the server from crashing, I fear that this still would not result in the behavior that you want, which is to cancel the index build.

We can keep this ticket open since there are still cases where killOp crashes the server, and we do want to eliminate those cases.

Comment by Vinicius Grippa [ 11/Jan/24 ]

I don't think an assertion and a crash are the best way to say an operation is not allowed in the database.

 

An error message should thrown and not allow the operation to continue. 

Comment by Louis Williams [ 11/Jan/24 ]

Hi vgrippa@gmail.com, this is expected behavior. Starting in 4.4, index builds are coordinated across all replica set nodes by the primary node. Forcefully killing an index build on a secondary prevents the index build from completing on other nodes, so the server must crash as a result. When the process restarts, the index build will resume until completion.

If you wish to cancel an in-progress index build, please use the dropIndexes command on the primary. killOp can also be used against the primary safely, but not on secondaries.

Comment by Vinicius Grippa [ 11/Jan/24 ]

Reproducible on 6.0.

Generated at Thu Feb 08 06:55:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.