-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Diagnostics, Replication
-
Environment:64-bit Linux, server 2.4.x, replica set
-
Replication
-
(copied to CRM)
When a replica has disk space issues it does not cause it to shut down, nor in the case of a primary, to step down. Instead the server periodically tests to see if enough space has been freed up to continue.
As a result, the only indication of a problem is additional writes cause user asserts on the primary, and there's likely some introduction of repl lag on secondaries. But, from the perspective of rs.status() and db.serverStatus() everything looks fine (except for any introduced asserts/lag).
Some options mentioned in discussion with server team:
- Have replset member step down (if primary)
- Have replset member enter maintenance status (until disk space is avail)
- Add warning message to [startup]warning log
Bonus: would be great if there was an explicit state/status change that could be picked up and reported by MMS. The last option should work for that.
- is duplicated by
-
SERVER-10634 Failover doesn't occur on disk full and other non-crash errors
- Closed
- is related to
-
SERVER-14139 Disk failure on one node can (eventually) block a whole cluster
- Closed
- related to
-
SERVER-3759 filesystem ops may cause termination when no space left on device
- Closed
-
SERVER-22971 Operations on some sharded collections fail with bogus error
- Closed
-
SERVER-17230 Replica set Primary should step down if Out of file descriptors
- In Progress