[SERVER-48490] Destroy the ServiceContext on clean shutdown Created: 29/May/20 Updated: 08/Jan/24 Resolved: 22/Jun/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Internal Code |
| Affects Version/s: | 4.4.0-rc7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | Backlog - Service Architecture |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | thread-sanitizer | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Assigned Teams: |
Service Arch
|
||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Backport Requested: |
v5.0
|
||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Linked BF Score: | 11 | ||||||||||||||||||||||||||||
| Description |
|
The MongoD, MongoS and DBTest processes create a ServiceContext at initialisation time, but never destroy it. This means a lot of internal services, which hang off the ServiceContext and use the Construction/Destruction actions do not actually get shut down. In the case of non-SAN builds, this is expected and valuable (because the time it takes to fully unwind all client stacks is long, and end users don't benefit from waiting), but for SAN builds we already unwind stacks, so also destroying the service context is completely acceptable. |
| Comments |
| Comment by Githook User [ 12/Aug/21 ] |
|
Author: {'name': 'George Wangensteen', 'email': 'george.wangensteen@mongodb.com', 'username': 'gewa24'}Message: |
| Comment by Ratika Gandhi [ 22/Jun/21 ] |
|
We will take care of this in the larger project for global clean shutdown (PM-774) so we are closing this ticket. |
| Comment by Githook User [ 17/Aug/20 ] |
|
Author: {'name': 'Andrew Morrow', 'email': 'acm@mongodb.com', 'username': 'acmorrow'}Message: |
| Comment by Ryan Egesdahl (Inactive) [ 29/Jul/20 ] |
|
daniel.gottlieb Sorry, I should have been more clear - my point wasn't about performance but to point out the fact that we're relying on undefined behavior. I mentioned performance because the choices in this test are to either accept the undefined behavior for the sake of test performance or to do it the "right way" for the test. Either choice has tradeoffs. I agree with acm that we should not be doing this in the codebase in any case for the reasons he stated, and I think the real point of this test is to demonstrate why that is the case and to get us thinking about how we can make progress towards undoing the history that got us here and what (if anything) on our list of priorities requires it. |
| Comment by Andrew Morrow (Inactive) [ 29/Jul/20 ] |
|
I'll add that while there may be good technical arguments in favor of doing a fast-path shutdown in production, it should be possible to do a clean shutdown by calling exit, or, better, returning from main. The fact that the mongodb codebase is unable to do so is a defect, and our long reliance on invoking _exit to work around that defect has only allowed that defect to metastasize unfettered throughout the codebase. It is now pervasive. Any work that moves us in the direction of being closer to achieving a clean exit is laudable. We will not, for instance, be able to serially instantiate multiple server instances within one address space without first being able to repeatedly create, run, and destroy such servers without terminating the process. Similarly, we will not be able to concurrently instantiate multiple independent server instances in an address space without first decoupling them from global mutable state. |
| Comment by Andrew Morrow (Inactive) [ 29/Jul/20 ] |
|
daniel.gottlieb - It isn't clearly stated here, but the scope of this is to do a more complete shutdown when running under one of the sanitizers. In those cases, the cost of shutdown should not be an issue. |
| Comment by Daniel Gottlieb (Inactive) [ 29/Jul/20 ] |
|
Presumably if we're going to cleanup service contexts and their decorations on shutdown, we should be doing the same for WT? Production releases tell WT to leak memory on clean shutdown and have done so for as long as I can remember. To free memory, WT has to do a lot of pointer chasing. It might be easier to measure the effects of flipping that configuration on a system that's paged in a lot of data (10s of gigabytes worth of 1KB documents) to get a measurement. I believe that setting exists because shutdown was found to be slow, but if someone wants to make a case that it's not slow they should certainly feel free to remeasure. |
| Comment by Ryan Egesdahl (Inactive) [ 29/Jul/20 ] |
|
mira.carey@mongodb.com You should be aware that in the case of Linux (and really any POSIX system), processes are required to clean up their messes before they exit. If we just do a quick exit, there's no guarantee on when the OS will reap resources. For Linux, that's how overcommit is designed to work, in fact - it's supposed to be lazy, and the memory will likely only be reclaimed once enough pressure is placed on the system, at which point it may start swapping pages out to disk before dropping the LRU ones. Opportunistic memory reclaim has been a topic of conversation within the Linux kernel community for years now, but it has never really made much headway: https://lwn.net/Articles/785633/ Do we have data on the claim that we save significant amounts of time and/or other resources during restarts with this approach? From a systems perspective, I strongly recommend against continuing the drop-all-and-die practice we've been doing unless there is some serious savings to be had, and even then I would consider that to be a problem to fix. This is undefined behavior in the *NIX world. I don't know about the Windows world, but I can't help but think it is there as well. |
| Comment by Mira Carey [ 29/May/20 ] |
|
I'm not sure we really want to do this. We've long had a stance of taking advantage of quickExit to shutdown the server (assuming it's going to be faster to restart a process by letting the os reclaim memory and file handles, rather than doing it ourselves), and destroying the ServiceContext during a normal exit implies waiting for much much longer than we do today. Specifically, we'll have to wait for every outstanding operation context to be destroyed, which will require us to unwind all user stacks. One area I might consider this for would be in address sanitizer builds, which I believe do at least attempt to shutdown the service executor (which waits on unwinding all user stacks). Or perhaps under a special flag. I'm just a little skeptical of the value in doing this, for our end users, in all production binaries all of the time |
| Comment by Kaloian Manassiev [ 29/May/20 ] |
|
I am going to take a first stab at investigating how to improve the Shutdown procedure. |