[SERVER-70127] Default system operations to be killable by stepdown Created: 30/Sep/22  Updated: 29/Oct/23  Resolved: 26/Apr/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0

Type: Improvement Priority: Major - P3
Reporter: Louis Williams Assignee: Jiawei Yang
Resolution: Fixed Votes: 1
Labels: repl-shortlist
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-74657 revisit if thread marked as unkillabl... Open
is depended on by SERVER-74658 revisit if thread marked as unkillabl... Open
is depended on by SERVER-74659 revisit if thread marked as unkillabl... Open
is depended on by SERVER-74656 revisit if thread marked as unkillabl... Closed
is depended on by SERVER-74660 revisit if thread marked as unkillabl... Closed
is depended on by SERVER-74661 revisit if thread marked as unkillabl... Closed
is depended on by SERVER-74662 Query work to revisit if threads curr... Closed
is depended on by SERVER-74953 Explore avoiding stepdowns during the... Closed
Problem/Incident
causes SERVER-75352 Make OplogBatcher's ReplBatcher threa... Closed
Related
related to SERVER-43174 Designate the MigrationDestinationMan... Closed
related to SERVER-58143 shardsvrDropCollectionParticipant sho... Closed
related to SERVER-58775 Mark ConfigsvrSetAllowMigrationsComma... Closed
related to SERVER-59635 Mark ConfigSvrMoveChunkCommand as int... Closed
related to SERVER-60521 Deadlock on stepup due to moveChunk c... Closed
related to SERVER-79026 Failing to cancel the JournalFlusher ... Closed
is related to SERVER-60161 Deadlock between config server stepdo... Closed
Assigned Teams:
Replication
Backwards Compatibility: Fully Compatible
Sprint: Repl 2023-03-06, Repl 2023-03-20, Repl 2023-05-01
Participants:
Linked BF Score: 135

 Description   

There are currently 30ish non-test calls to setSystemOperationKillableByStepdown(). Every time we introduce a new thread, there’s a non-obvious requirement to call that function.

Failing to do so results in the process crashing if the operation hits a prepare conflict. This is a rare occurence, which means we risk not catching crashing bugs in testing. In addition to the visual clutter, the API risks that developers create new internal threads that are unkilllable when they shouldn't be.

It seems that there are only a few system operations that actually need to be unkilllable and the vast majority of all threads should be killable.

We should consider changing the default such that system operations are always killable and have the limited set of special operations explicitly opt-in to being unkillable.



 Comments   
Comment by Githook User [ 19/May/23 ]

Author:

{'name': 'Josef Ahmad', 'email': 'josef.ahmad@mongodb.com', 'username': 'josefahmad'}

Message: SERVER-77083 Index build stepup async task should tolerate stepdowns

(cherry picked from commit https://github.com/10gen/mongo/commit/846ed8250b4c3322e52c2e51c4a6992c6ce7ba34)

This cherry-pick also marks the stepup async task as killable by stepdown,
because this revision predates SERVER-70127 which defaults system operations
to be killable by stepdown.

Conflicts:
src/mongo/db/index_builds_coordinator.cpp
Branch: v7.0
https://github.com/mongodb/mongo/commit/dee4cd988431e8753bde5929216a74400561ba19

Comment by Githook User [ 26/Apr/23 ]

Author:

{'name': 'Jiawei Yang', 'email': 'jiawei.yang@mongodb.com', 'username': 'YoungYang0820'}

Message: SERVER-70127 change system operations to be killable by default
Branch: master
https://github.com/mongodb/mongo/commit/606e34054ef33e59b78715263b125ff7ebea1394

Comment by Githook User [ 25/Apr/23 ]

Author:

{'name': 'Sviatlana Zuiko', 'email': 'sviatlana.zuiko@mongodb.com', 'username': 'szuiko'}

Message: Revert "SERVER-70127 change system operations to be killable by default"

This reverts commit c35bad3b048e8d885bf0b7517aacd2349ea81d14.
Branch: master
https://github.com/mongodb/mongo/commit/0ed3c5ba08d56e308bf05959932b34d8d1e6040e

Comment by Githook User [ 25/Apr/23 ]

Author:

{'name': 'Jiawei Yang', 'email': 'jiawei.yang@mongodb.com', 'username': 'YoungYang0820'}

Message: SERVER-70127 change system operations to be killable by default
Branch: master
https://github.com/mongodb/mongo/commit/c35bad3b048e8d885bf0b7517aacd2349ea81d14

Comment by Jiawei Yang [ 13/Apr/23 ]

This is reverted for safely shipping 7.0.0-rc0 and will be recommitted after rc0 branch cut.

Comment by Jiawei Yang [ 05/Apr/23 ]

Hi yujin.kang@mongodb.com, thanks for asking. This is planning to be done soon after rc0 branch cut.

Comment by Githook User [ 30/Mar/23 ]

Author:

{'name': 'Jiawei Yang', 'email': 'jiawei.yang@mongodb.com', 'username': 'YoungYang0820'}

Message: Revert "SERVER-70127 change system operation threads to be killable by default"

This reverts commit 9f2867c9da77e2d64df3852f7d4578f10e6f0817.

Revert "SERVER-75352 OplogBatcher's ReplBatcher thread should be unkillable"

This reverts commit 26266d5b736f90961a328399dea5d299cd407ab2.
Branch: master
https://github.com/mongodb/mongo/commit/b5f1d6bb8c06742cde53f028fd266eff584a2537

Comment by Githook User [ 13/Mar/23 ]

Author:

{'name': 'Jiawei Yang', 'email': 'jiawei.yang@mongodb.com', 'username': 'YoungYang0820'}

Message: SERVER-70127 change system operation threads to be killable by default
Branch: master
https://github.com/mongodb/mongo/commit/9f2867c9da77e2d64df3852f7d4578f10e6f0817

Generated at Thu Feb 08 06:15:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.