[SERVER-9552] when replica set member has full disk, step down to (sec|rec)? Created: 03/May/13 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Diagnostics, Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | John Morales | Assignee: | Backlog - Replication Team |
| Resolution: | Unresolved | Votes: | 10 |
| Labels: | elections | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
64-bit Linux, server 2.4.x, replica set |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||
| Description |
|
When a replica has disk space issues it does not cause it to shut down, nor in the case of a primary, to step down. Instead the server periodically tests to see if enough space has been freed up to continue. As a result, the only indication of a problem is additional writes cause user asserts on the primary, and there's likely some introduction of repl lag on secondaries. But, from the perspective of rs.status() and db.serverStatus() everything looks fine (except for any introduced asserts/lag). Some options mentioned in discussion with server team:
Bonus: would be great if there was an explicit state/status change that could be picked up and reported by MMS. The last option should work for that. |
| Comments |
| Comment by Henrik Ingo (Inactive) [ 08/Aug/17 ] |
|
Is this a duplicate of |
| Comment by Anne Moroney [ 05/May/14 ] |
|
1.Isn't it cheap and easy to "Add warning message to [startup]warning log"? Can't we please at least have that much of a solution in a patch upgrade? There could be a warning in the regular log as well, perhaps throttled to one per ten minutes or something if that is necessary (I don't know the code). 2.This would be also possible to fix if there were an option to exclude databases or collections from balancing, right? ( see http://www.codejuggle.dj/facts-to-know-about-mongodb/ ) |
| Comment by Ian Bentley [ 01/Oct/13 ] |
|
Failover also doesn't occur when an EBS volume is unmounted from a running AWS instance. This can cause a replica set to have a primary that accepts no writes, but won't step down. |
| Comment by Daniel Watrous [ 30/May/13 ] |
|
This just bit us again in the form of intermittent failures. Occasionally we were getting valid reads unless it went against the secondary that was out of disk. Hopefully this implementation removes the node from secondary status too, since no writes can be replicated when it's out of disk space. |