[SERVER-9552] when replica set member has full disk, step down to (sec|rec)? Created: 03/May/13  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Diagnostics, Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: John Morales Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 10
Labels: elections
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

64-bit Linux, server 2.4.x, replica set


Issue Links:
Depends
Duplicate
is duplicated by SERVER-10634 Failover doesn't occur on disk full a... Closed
Related
related to SERVER-3759 filesystem ops may cause termination ... Closed
related to SERVER-22971 Operations on some sharded collection... Closed
related to SERVER-17230 Replica set Primary should step down ... In Progress
is related to SERVER-14139 Disk failure on one node can (eventua... Closed
Assigned Teams:
Replication
Participants:
Case:

 Description   

When a replica has disk space issues it does not cause it to shut down, nor in the case of a primary, to step down. Instead the server periodically tests to see if enough space has been freed up to continue.

As a result, the only indication of a problem is additional writes cause user asserts on the primary, and there's likely some introduction of repl lag on secondaries. But, from the perspective of rs.status() and db.serverStatus() everything looks fine (except for any introduced asserts/lag).

Some options mentioned in discussion with server team:

  • Have replset member step down (if primary)
  • Have replset member enter maintenance status (until disk space is avail)
  • Add warning message to [startup]warning log

Bonus: would be great if there was an explicit state/status change that could be picked up and reported by MMS. The last option should work for that.



 Comments   
Comment by Henrik Ingo (Inactive) [ 08/Aug/17 ]

Is this a duplicate of SERVER-29947? If not, why not? (Admittedly, I just looked at these on a headline basis.)

Comment by Anne Moroney [ 05/May/14 ]

1.Isn't it cheap and easy to "Add warning message to [startup]warning log"? Can't we please at least have that much of a solution in a patch upgrade? There could be a warning in the regular log as well, perhaps throttled to one per ten minutes or something if that is necessary (I don't know the code).

2.This would be also possible to fix if there were an option to exclude databases or collections from balancing, right? ( see http://www.codejuggle.dj/facts-to-know-about-mongodb/ )

Comment by Ian Bentley [ 01/Oct/13 ]

Failover also doesn't occur when an EBS volume is unmounted from a running AWS instance. This can cause a replica set to have a primary that accepts no writes, but won't step down.

Comment by Daniel Watrous [ 30/May/13 ]

This just bit us again in the form of intermittent failures. Occasionally we were getting valid reads unless it went against the secondary that was out of disk. Hopefully this implementation removes the node from secondary status too, since no writes can be replicated when it's out of disk space.

Generated at Thu Feb 08 03:20:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.