[SERVER-13811] Deal better/Fail more gracefully when mongoD runs out of disk space Created: 01/May/14 Updated: 13/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Stability, Storage |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Osmar Olivo | Assignee: | Brian Lane |
| Resolution: | Unresolved | Votes: | 8 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||
| Description |
|
Currently when a mongoD process runs out of disk space and fails to preallocate a file or write to the journal, it responds with terminating the server process. This proves to be a difficult place to be in because the remove operation in and of itself will fail when attempting to reclaim space. Furthermore, things that write to disk temporarily like external sort or temporary agg results will also have problems with this. A more graceful approach would be to allow us to limit mongoD space utilization to some threshold before filling the disk, so that cleanup and stabilization of the system is facilitated. Something like "Stop accepting writes (other than removes) if less than 10% (or some number of GB) disk space available" or "If preallocation fails due to lack of space (2GB) for the final datafile, stop accepting writes aside from removes" would be much more graceful. This would of course mean $out and external sorts should fail as well. but would save from dealing with all the other issues associated with full disk. Of course there are edge cases to be considered such as, if a secondary hits this threshold, it can no longer replicate therefore it should be marked as down or unavailable with respect to the quorum. (Which I believe already happens ) but then how do we process cleanup if it can't replicate the removes? We'll just have to increase capacity or do a full resync in situations where a secondary runs out of disk before a primary. But for the general case, this would be a huge win, whether the number is configurable or not. |
| Comments |
| Comment by Steven Vannelli [ 10/May/22 ] | ||||||||||||||||||||||||||||||||||||||||
|
Moving this ticket to the Backlog and removing the "Backlog" fixVersion as per our latest policy for using fixVersions. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Kyle Mertz [ 24/Sep/18 ] | ||||||||||||||||||||||||||||||||||||||||
|
I agree with this. The system shouldn't stop working if you run out of space. Write should obviously fail, but reads and deletes should still be accepted. Causing an outage when one isn't necessary is bad design. Think of how Oracle has their archiver error when disk fills up, it still allows read and delete operations, but insert/updates will fail. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Geert Bosch [ 01/Dec/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Example backtrace of running out of diskspace with WiredTiger as storage engine.
|