[SERVER-73036] Investigate potential deadlock with index builds Created: 19/Jan/23  Updated: 29/Oct/23  Resolved: 28/Mar/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.0.0-rc0

Type: Task Priority: Major - P3
Reporter: Jordi Olivares Provencio Assignee: Josef Ahmad
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File diff.patch    
Issue Links:
Related
related to SERVER-73294 Signal primary node for aborting inde... Closed
related to SERVER-74657 revisit if thread marked as unkillabl... Open
related to SERVER-71198 Assert that unkillable operations tha... Backlog
Assigned Teams:
Storage Execution
Backwards Compatibility: Fully Compatible
Sprint: Execution Team 2023-04-03
Participants:

 Description   

During SERVER-71198 we uncovered that the deadlock described by SERVER-71191 and SERVER-44722 could theoretically occur during index cleanup in a rollback.

Running resmoke with --suite=replica_sets jstests/replsets/rollback_index_build_start_abort.js and the attached patch will yield an invariant error. The following two lines from this backtrace are relevant:

[js_test:geo_near_bounds_overflow] Fixture status:
...
[j7:prim] | 2023-01-19T14:29:54.969+00:00 I  CONTROL  31445   [IndexBuildsCoordinatorMongod-0] "Frame","attr":\{"frame":{"a":"7FA8C4FA0649","b":"7FA8C4F56000","o":"4A649","s":"_ZN5mongo22IndexBuildsCoordinator28_cleanUpTwoPhaseAfterFailureEPNS_16OperationContextERKNS_13CollectionPtrESt10shared_ptrINS_19ReplIndexBuildStateEERKNS0_17IndexBuildOptionsERKNS_6StatusE","C":"mongo::IndexBuildsCoordinator::_cleanUpTwoPhaseAfterFailure(mongo::OperationContext*, mongo::CollectionPtr const&, std::shared_ptr<mongo::ReplIndexBuildState>, mongo::IndexBuildsCoordinator::IndexBuildOptions const&, mongo::Status const&)","s+":"1F9"}}
[j7:prim] | 2023-01-19T14:29:54.969+00:00 I  CONTROL  31445   [IndexBuildsCoordinatorMongod-0] "Frame","attr":\{"frame":{"a":"7FA8C4FABA07","b":"7FA8C4F56000","o":"55A07","s":"_ZN5mongo22IndexBuildsCoordinator16_setUpIndexBuildEPNS_16OperationContextERKNS_4UUIDENS_9TimestampERKNS0_17IndexBuildOptionsE.cold","C":"mongo::IndexBuildsCoordinator::_setUpIndexBuild(mongo::OperationContext*, mongo::UUID const&, mongo::Timestamp, mongo::IndexBuildsCoordinator::IndexBuildOptions const&) [clone .cold]","s+":"2A5"}}
...
Symbolization process completed.



 Comments   
Comment by Githook User [ 28/Mar/23 ]

Author:

{'name': 'Josef Ahmad', 'email': 'josef.ahmad@mongodb.com', 'username': 'josefahmad'}

Message: SERVER-73036 Make killOp on createIndexes interruptible by stepdown
Branch: master
https://github.com/mongodb/mongo/commit/058896012d3dc2cffeeb07cbae69d2eab8c278a1

Comment by Josef Ahmad [ 22/Mar/23 ]

The bulk of this issue is resolved as of SERVER-73294: we abort an index build via the voteAbortIndexBuild command (which loops back if the caller is the primary). Unlike the index build coordinator thread, the voteAbortIndexBuild thread is interruptible by stepdown, eliminating a necessary condition for the deadlock.

That said, the reproducer shows another positive when running killOp against the user's createIndexes command. This is because the client is marked as uninterruptible by stepdown. SERVER-74657 is a related ticket to revisit whether that client (among others) actually needs to be uninterruptible; I'll figure that out as part of this ticket.

Generated at Thu Feb 08 06:23:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.