[SERVER-10634] Failover doesn't occur on disk full and other non-crash errors Created: 28/Aug/13  Updated: 10/Dec/14  Resolved: 29/Aug/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Henrik Ingo (Inactive) Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Amazon Linux, official MongoDB.org packages


Issue Links:
Duplicate
duplicates SERVER-9552 when replica set member has full disk... Backlog
Operating System: ALL
Steps To Reproduce:

Setup a 3 node replica set. Use separate servers or at least different data disk for each node.

Using the following script, insert data into a collection:

#!/bin/bash

while true
do
TIME=`date`
echo "db.test.insert(

Unknown macro: { time }

)" | mongo test
done

Use a tool like dd if=/dev/zero of=/tmp/eatspace bs=1024 count=1024 to fill the disk. Note that even after the disk is full, MongoDB will continue to successfully insert more data until the last 2 GB data file becomes full.

What actually happens:

Observe following errors from the insert:

MongoDB shell version: 2.4.6
connecting to: test
Can't take a write lock while out of disk space
bye

And in the log:

Wed Aug 28 11:08:07.306 [FileAllocator] allocating new datafile /var/lib/mongo/test.2, filling with zeroes...
Wed Aug 28 11:08:07.307 [FileAllocator] FileAllocator: posix_fallocate failed: errno:28 No space left on device falling back
Wed Aug 28 11:08:07.307 [FileAllocator] error: failed to allocate new file: /var/lib/mongo/test.2 size: 268435456 failure creating new datafile; lseek failed for fd 22 with errno: errno:2 No such file or directory. will try again in 10 seconds

What should happen

When failing to allocate a new datafile, the primary should step down and allow another node to become primary. In addition, it should go into a state where it cannot become primary again (for example, if it has a high priority) until the problem has been fixed.

Workarounds

When noticing the failure, the DBA must call rs.stepDown() or shut down the failing mongod process. rs.stepDown() could also be called automatically from an application that receives disk full or other similar error message. In addition, it might make sense to set the node into hidden or priority=0 state until problem is fixed.

Participants:

 Description   

Summary:

Given a replica set with 3 or more nodes, if the PRIMARY node is shutdown, crashes, or becomes available due to network issues, the other nodes will proceed to elect a new PRIMARY and automatic failover occurs within seconds.

However, in other error situations where the mongod process remains alive and continues to respond to heartbeats, failover will not happen, but write operations to the PRIMARY will fail, rendering the cluster unusable and de facto unavailable (for writes).

An example of such error situation is a disk error such as disk full.



 Comments   
Comment by Daniel Pasette (Inactive) [ 29/Aug/13 ]

duplicate of SERVER-9552

Generated at Thu Feb 08 03:23:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.