[SERVER-33522] Possible to call TaskExecutor::signalEvent twice during shutdown Created: 27/Feb/18  Updated: 29/Oct/23  Resolved: 18/May/22

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Ian Boros Assignee: Amirsaman Memaripour
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-25497 Fix sharded query path to handle shut... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Service Arch 2022-05-30
Participants:

 Description   

When a mongos shuts down, it only attempts to join with client threads when ASAN is enabled, and even then, it does so with a timeout before exiting the process. Before this happens, it calls shutdownAndJoin on the TaskExecutorPool. Therefore, client threads may still be running while the ThreadPoolTaskExecutor is in a call to join(). If join() completes (and as part of completing, signals all of the unsignaled events) just before a client thread tries to signal an event, the client thread will signal the event for a second time, and trigger an invariant(). I believe this is a bug (rather than a misuse) of TaskExecutor.

One way to solve this would be to make signalEvent() a no-op when the TaskExecutor is in shutdown. This way we guarantee every event is signaled exactly once: Either it is signaled before shutdown, or it is signaled as part of shutdown, and all subsequent calls to signalEvent() don't do anything.

Another way of solving this would be to change the order of shutdown, so that we join with all client threads before shutting down the TaskExecutor. Right now, we don't even attempt to join with client threads unless we're running under ASAN, and even then, we do so with a timeout, so this would be a significant change.

I believe this problem is the cause of:
SERVER-25497

AC: Choose one of the two proposed solutions (or a potential third?).



 Comments   
Comment by Githook User [ 18/May/22 ]

Author:

{'name': 'Amirsaman Memaripour', 'email': 'amirsaman.memaripour@mongodb.com', 'username': 'samanca'}

Message: SERVER-33522 Skip `TaskExecutor::signalEvent` when shutdown is in progress
Branch: master
https://github.com/mongodb/mongo/commit/48610db5accc050ca732f5533bd99bd2bebcf59f

Comment by Max Hirschhorn [ 08/Apr/22 ]

Reopening this ticket because server crashes at shutdown are undesirable and lead to scary-looking backtraces in server logs of our end users.

Comment by Lauren Lewis (Inactive) [ 24/Feb/22 ]

We haven’t heard back from you for at least one calendar year, so this issue is being closed. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Comment by Ruoxin Xu [ 15/Oct/20 ]

matthew.tretin The patch for SERVER-25497 is not making any change in TaskExecutor but in AsyncResultsMerger(ARM) to avoid race conditions where the underlying TaskExecutor and AsyncResultsMerger are simultaneously being shut down by using std::promise/future instead of TaskExecutor 'event' mechanism in ARM, which should be able to fix the problem in SERVER-25497. This ticket, IIUC, is for the potential multiple calls to TaskExecutor::signalEvent(), possibly a bug of TaskExecutor.

Comment by Matthew Tretin (Inactive) [ 12/Oct/20 ]

ruoxin.xu Do you think your CR for SERVER-25497 will take care of it? 

Comment by Andy Schwerin [ 28/Feb/18 ]

Oh, yeah, when I built the TaskExecutor, I didn't really consider Events that would be signaled from outside of Callbacks while working out the shutdown logic.

Generated at Thu Feb 08 04:33:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.