[SERVER-69496] InterruptedAtShutdown can be thrown without the operation context being marked as killed Created: 07/Sep/22  Updated: 27/Oct/23  Resolved: 03/Oct/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Gregory Noma Assignee: Matt Diener (Inactive)
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-70010 Stop using getKillStatus to check for... Closed
Related
related to SERVER-68808 Catch CancellationError in ReplClient... Closed
related to SERVER-69505 Temporarily allow InterruptedAtShutdo... Closed
related to SERVER-69593 Temporarily allow InterruptedAtShutdo... Closed
related to SERVER-69839 Complete TODO listed in SERVER-69496 Closed
related to SERVER-69504 Leave index build phase intact upon e... Closed
Operating System: ALL
Sprint: Service Arch 2022-09-19, Service Arch 2022-10-03, Service Arch 2022-10-17
Participants:
Linked BF Score: 151

 Description   

During shutdown, we first set the global kill flag and then mark each operation context as killed. In OperationContext::checkForInterruptNoAssert, we return InterruptedAtShutdown if the global kill flag is set. However, we do not mark the operation context as killed from here. This means that if an operation explicitly checks for interrupts and we get an interleaving where this occurs between the two steps in ServiceContext::setKillAllOperations, that operation will get an InterruptedAtShutdown exception with the operation context not (yet) marked as killed.



 Comments   
Comment by Matt Diener (Inactive) [ 03/Oct/22 ]

SERVER-70010 is fixed. As much as this conflation between Killed and Interrupted is likely to cause confusion again in the future, and as complex as the distinction may be. Fixing it is not a short-term endeavor.

Instead of this task, SERVER-70010 should keep us in good shape while SA takes this feedback into account and carries forward existing initiatives to take a 2nd look at interruption semantics, error handling semantics, and complexity around the OperationContext.

Comment by Matt Diener (Inactive) [ 29/Sep/22 ]

Update: sent out 3 PRs for SERVER-70010. Once these are closed, this bug can be closed as "Works as Designed". Code using killStatus because of SERVER-56251 will no longer exist.

Comment by Matt Diener (Inactive) [ 27/Sep/22 ]

Moving this back to open, but keeping it open until the completion of SERVER-70010. Then I will close as "Works as Designed"

Comment by Louis Williams [ 21/Sep/22 ]

Does this mean that the current requirement for every catch() that checks isKillPending must also check whether the thrown exception was InterruptedAtShutdown? I feel like this is quite risky for managing exceptions in the server and determining whether or not an operation was interrupted. We already had to handle some test fallout from this requirement (e.g. SERVER-69593 and SERVER-69505) from the changes we made in our codebase. I fear that the new expectations around interruptions are more complicated than previous API of checking for the Interruption error category and risk uncaught exceptions escaping threads.

Comment by Max Hirschhorn [ 20/Sep/22 ]

If we are asking whether work associated with an opCtx should stop, we should use `checkForInterruptNoAssert` to get a Status. If we care about whether the opCtx is killed or explicitly want to know the reason it has been killed, use `getKillStatus()`.

Usage of both may be required for some use cases, as `checkForInterruptNoAssert` doesn't guarantee it'll return the kill Status.

matt.diener@mongodb.com, can you clarify about the instructions and possible solutions in the tickets linked to SERVER-56251 (e.g. SERVER-67617)? Calling OperationContext::getKillStatus() is what had been proposed there.

Comment by Matt Diener (Inactive) [ 20/Sep/22 ]

The design is unfortunately confusing but there is a distinction between Interrupted and Killed in the opCtx.

If we are asking whether work associated with an opCtx should stop, we should use `checkForInterruptNoAssert` to get a Status. If we care about whether the opCtx is killed or explicitly want to know the reason it has been killed, use `getKillStatus()`.

Usage of both may be required for some use cases, as `checkForInterruptNoAssert` doesn't guarantee it'll return the kill Status.

Generated at Thu Feb 08 06:13:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.