[SERVER-48490] Destroy the ServiceContext on clean shutdown Created: 29/May/20  Updated: 08/Jan/24  Resolved: 22/Jun/21

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: 4.4.0-rc7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Backlog - Service Architecture
Resolution: Duplicate Votes: 0
Labels: thread-sanitizer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-49759 Destruct AuthorizationManager on shut... Closed
Related
related to SERVER-49762 Destroy the ServiceContext on clean s... Closed
related to SERVER-57984 Complete TODO listed in SERVER-48490 Closed
Assigned Teams:
Service Arch
Operating System: ALL
Backport Requested:
v5.0
Participants:
Linked BF Score: 11

 Description   

The MongoD, MongoS and DBTest processes create a ServiceContext at initialisation time, but never destroy it. This means a lot of internal services, which hang off the ServiceContext and use the Construction/Destruction actions do not actually get shut down.

In the case of non-SAN builds, this is expected and valuable (because the time it takes to fully unwind all client stacks is long, and end users don't benefit from waiting), but for SAN builds we already unwind stacks, so also destroying the service context is completely acceptable.



 Comments   
Comment by Githook User [ 12/Aug/21 ]

Author:

{'name': 'George Wangensteen', 'email': 'george.wangensteen@mongodb.com', 'username': 'gewa24'}

Message: SERVER-57984 Update references to SERVER-48490 to SERVER-52413 which marks completion of global clean shutdown work
Branch: master
https://github.com/mongodb/mongo/commit/d9f51262487d11e03272bb403efae2b6aff114d7

Comment by Ratika Gandhi [ 22/Jun/21 ]

We will take care of this in the larger project for global clean shutdown (PM-774) so we are closing this ticket. 

Comment by Githook User [ 17/Aug/20 ]

Author:

{'name': 'Andrew Morrow', 'email': 'acm@mongodb.com', 'username': 'acmorrow'}

Message: SERVER-48490 Suppress thread_leak errors under TSAN
Branch: master
https://github.com/mongodb/mongo/commit/99e85a18ec8a12c5988fe03ad83e0f064cc6dd6f

Comment by Ryan Egesdahl (Inactive) [ 29/Jul/20 ]

daniel.gottlieb Sorry, I should have been more clear - my point wasn't about performance but to point out the fact that we're relying on undefined behavior. I mentioned performance because the choices in this test are to either accept the undefined behavior for the sake of test performance or to do it the "right way" for the test. Either choice has tradeoffs. I agree with acm that we should not be doing this in the codebase in any case for the reasons he stated, and I think the real point of this test is to demonstrate why that is the case and to get us thinking about how we can make progress towards undoing the history that got us here and what (if anything) on our list of priorities requires it.

Comment by Andrew Morrow (Inactive) [ 29/Jul/20 ]

I'll add that while there may be good technical arguments in favor of doing a fast-path shutdown in production, it should be possible to do a clean shutdown by calling exit, or, better, returning from main. The fact that the mongodb codebase is unable to do so is a defect, and our long reliance on invoking _exit to work around that defect has only allowed that defect to metastasize unfettered throughout the codebase. It is now pervasive. Any work that moves us in the direction of being closer to achieving a clean exit is laudable. We will not, for instance, be able to serially instantiate multiple server instances within one address space without first being able to repeatedly create, run, and destroy such servers without terminating the process. Similarly, we will not be able to concurrently instantiate multiple independent server instances in an address space without first decoupling them from global mutable state.

Comment by Andrew Morrow (Inactive) [ 29/Jul/20 ]

daniel.gottlieb - It isn't clearly stated here, but the scope of this is to do a more complete shutdown when running under one of the sanitizers. In those cases, the cost of shutdown should not be an issue.

Comment by Daniel Gottlieb (Inactive) [ 29/Jul/20 ]

Presumably if we're going to cleanup service contexts and their decorations on shutdown, we should be doing the same for WT?

Production releases tell WT to leak memory on clean shutdown and have done so for as long as I can remember. To free memory, WT has to do a lot of pointer chasing. It might be easier to measure the effects of flipping that configuration on a system that's paged in a lot of data (10s of gigabytes worth of 1KB documents) to get a measurement. I believe that setting exists because shutdown was found to be slow, but if someone wants to make a case that it's not slow they should certainly feel free to remeasure.

Comment by Ryan Egesdahl (Inactive) [ 29/Jul/20 ]

mira.carey@mongodb.com You should be aware that in the case of Linux (and really any POSIX system), processes are required to clean up their messes before they exit. If we just do a quick exit, there's no guarantee on when the OS will reap resources. For Linux, that's how overcommit is designed to work, in fact - it's supposed to be lazy, and the memory will likely only be reclaimed once enough pressure is placed on the system, at which point it may start swapping pages out to disk before dropping the LRU ones. Opportunistic memory reclaim has been a topic of conversation within the Linux kernel community for years now, but it has never really made much headway: https://lwn.net/Articles/785633/

Do we have data on the claim that we save significant amounts of time and/or other resources during restarts with this approach? From a systems perspective, I strongly recommend against continuing the drop-all-and-die practice we've been doing unless there is some serious savings to be had, and even then I would consider that to be a problem to fix. This is undefined behavior in the *NIX world. I don't know about the Windows world, but I can't help but think it is there as well.

Comment by Mira Carey [ 29/May/20 ]

I'm not sure we really want to do this. We've long had a stance of taking advantage of quickExit to shutdown the server (assuming it's going to be faster to restart a process by letting the os reclaim memory and file handles, rather than doing it ourselves), and destroying the ServiceContext during a normal exit implies waiting for much much longer than we do today. Specifically, we'll have to wait for every outstanding operation context to be destroyed, which will require us to unwind all user stacks.

One area I might consider this for would be in address sanitizer builds, which I believe do at least attempt to shutdown the service executor (which waits on unwinding all user stacks). Or perhaps under a special flag.

I'm just a little skeptical of the value in doing this, for our end users, in all production binaries all of the time

Comment by Kaloian Manassiev [ 29/May/20 ]

I am going to take a first stab at investigating how to improve the Shutdown procedure.

Generated at Thu Feb 08 05:17:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.