[SERVER-71946] Async RPC expects all executor shutdown errors to be in the "cancellation" category, but those from ScopedTaskExecutor may not be Created: 07/Dec/22 Updated: 29/Oct/23 Resolved: 03/Feb/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 6.3.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | George Wangensteen | Assignee: | Alex Li |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Assigned Teams: |
Service Arch
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Operating System: | ALL | ||||
| Sprint: | Service Arch 2022-12-26, Service Arch 2023-01-09, Service Arch 2023-01-23, Service Arch 2023-02-06 | ||||
| Participants: | |||||
| Linked BF Score: | 10 | ||||
| Description |
|
The async_rpc library should not expose any errors other than RemoteCommandExecutionError, with additional information contained in the ErrorExtraInfo for that type. However, executor shutdown can prevent earlier continuations that should correctly process the error from running, so at the API boundary we re-write such executor shutdown errors before exiting the library. Because we rely on the internals of the library to append correct information to the EEI for non-executor-shutdown errors, we try to expect (in debug mode) that all errors recieved at this boundary are either RemoteCommandExecutionError, or executor shutdown errors. We best-effort check if an error is an executor-shutdown error by checking if it's in the Cancellation error category. However, ScopedTaskExecutor can use arbitrary error codes to reject work at shutdown, so it violates the assumption that "executor shutdown errors are in the Cancellation category." One is example is it uses the (non-cancellation) InterruptedDueToReplStateChange when it is used in PrimaryOnlyService.
To fix this, we should consider either relaxing or removing the async_rpc expectation on executor-shutdown errors, or add InterruptedDueToReplStateChange to the "cancellation" error category and ensure future error codes used for ScopedTaskExecutorShutdown are in this category as well. |
| Comments |
| Comment by Githook User [ 03/Feb/23 ] | |||
|
Author: {'name': 'Alex Li', 'email': 'alex.li@mongodb.com', 'username': 'lia394126'}Message: | |||
| Comment by Alex Li [ 31/Jan/23 ] | |||
|
Changed kExecutorShutdownStatus (code here) inside PrimaryOnlyService from InterruptedDueToReplStateChange to CallbackCanceled in order to pass an assertion placed in async_rpc::sendCommand that guarantees all non RemoteCommandExecutionErrors are CancellationErrors. An invariant was also placed on ScopedTaskExecutor to enforce CancellationError on shutdown. The change is supported by the fact that the current usage of InterruptedDueToReplStateChange in PrimaryOnlyService violates the OutOfLineExecutor::schedule contract that non OK statuses must be cancellation errors. Also, the kExecutorShutdownStatus change should not introduce any problems, because no PrimaryOnlyService can reliably predict whether it would get CallbackCanceled or InterruptedDueToReplStateChange at executor shutdown, so support for CallbackCanceled should already be in place. | |||
| Comment by Max Hirschhorn [ 18/Jan/23 ] | |||
|
I like Alex's proposed change to ScopedTaskExecutor to verify the supplied shutdownStatus is in the CancellationError category. The usage of InterruptedDueToReplStateChange by kExecutorShutdownStatus in primary_only_service.cpp is currently a violation of the TaskExecutor::schedule() interface contract.
I also buy into, from TaskExecutor::schedule()'s perspective, any error code is safe to add to the CancellationError category because the interface is about OK versus non-OK Statuses. (Interestingly, ShardServerCatalogCacheLoader actually bothers to care a non-OK Status is in the CancellationError category.) What I would be nervous about for growing the definition of the CancellationError category is ExecutorFutures mask the origin of the error for whether it came from TaskExecutor::schedule() or the callback itself. A relatively small number of commands internally run work on executors where an error Status in the CancellationError category has tended to be a "local error" and other errors are "remote errors". This naturally isn't a precise definition but I worry about changes to that assumption. Given that PrimaryOnlyServices rely on CancellationToken and future_util::withCancellation() always uses CallbackCanceled, I claim no PrimaryOnlyService can reliably predict whether it would get CallbackCanceled or InterruptedDueToReplStateChange at executor shutdown. And that any robust service communicating with another server must retry on both errors to learn the definitive result (e.g. | |||
| Comment by George Wangensteen [ 13/Jan/23 ] | |||
|
alex.li@mongodb.com and I spoke about this a bit today. max.hirschhorn@mongodb.com , if I understand your concern correctly, you're worried that: -> We will add errors to the Cancellation category because it suits our needs here (and others may do so in the future) -> These changes may violate others' expectations for what error codes can be used to mark cancellation/what the ErrorCategory::CancellationError means -> As a result, new or existing code based on those expectations will break - just as we saw bugs arise when people's/code's expectations for ErrorCategory::Interruption did not meet what codes were in the category after people (over time) added codes to the category to fix bugs, leading to SERVER-56251.
For context, the assertion that's failing here is meant to check that an error is originating from executor shutdown. The reason for this assertion is that we expect the API to re-write other errors into a particular code as appropriate; if that re-writing is unable to happen, it must because the executor refused to run the code that does the re-writing. We want to make sure that only executor-shutdown errors reach this point without being re-written; otherwise, we risk erroneously rewriting other errors here after we have potentially lost some context about them, when we should have done that rewriting earlier. I used "CancellationError" because I thought (based on comments in taskexecutor and others) that all executor-shutdown errors were cancellation error.
I think the concern about bloating CancellationError probably outweighs the risk of doing lossy rewriting of other errors due to some bug where those errors are escaping. So I think I'd vote to just remove the assertion, accept that we can't currently tell if an error is an executor-shutdown error, and maybe file a ticket to create a new category of "ExecutorShutdownError" and add a TODO to the code. But some other options would be to: -> Accept that InterruptedDueToReplStateChange effectively is a cancellation error because it's used to cancel work; ensure other errors used by ScopedTaskExecutor in the future are already in this category, and add InterruptedDueToReplStateChange to the CancellationCategory (essentially Alex' patch. My one concern here would be is there an issue with mixing Cancellation and Retryable errors ?) -> Add a special case to the assert saying InterruptedDueToReplStateChange is allowed because of ScopedTaskExecutor. This makes the BF go away, leaves the error category alone, but makes the code less obvious and potentially increases the maintenance burden very marginally (if people add new STE with custom shutdown codes and use them with this API, those codes will need to be special cased as well).
max.hirschhorn@mongodb.com , can you confirm I understood you correctly? Which of these solutions sounds the least concerning to you given that vs. the other concerns? Thanks for bringing this up! | |||
| Comment by Max Hirschhorn [ 12/Jan/23 ] | |||
|
Are we at risk of ErrorCategory::CancellationError being like ErrorCategory::Interruption (SERVER-56251, |