[SERVER-10634] Failover doesn't occur on disk full and other non-crash errors Created: 28/Aug/13 Updated: 10/Dec/14 Resolved: 29/Aug/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Henrik Ingo (Inactive) | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Amazon Linux, official MongoDB.org packages |
||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | Setup a 3 node replica set. Use separate servers or at least different data disk for each node. Using the following script, insert data into a collection:
Use a tool like dd if=/dev/zero of=/tmp/eatspace bs=1024 count=1024 to fill the disk. Note that even after the disk is full, MongoDB will continue to successfully insert more data until the last 2 GB data file becomes full. What actually happens: Observe following errors from the insert:
And in the log:
What should happen When failing to allocate a new datafile, the primary should step down and allow another node to become primary. In addition, it should go into a state where it cannot become primary again (for example, if it has a high priority) until the problem has been fixed. Workarounds When noticing the failure, the DBA must call rs.stepDown() or shut down the failing mongod process. rs.stepDown() could also be called automatically from an application that receives disk full or other similar error message. In addition, it might make sense to set the node into hidden or priority=0 state until problem is fixed. |
||||||||
| Participants: | |||||||||
| Description |
|
Summary: Given a replica set with 3 or more nodes, if the PRIMARY node is shutdown, crashes, or becomes available due to network issues, the other nodes will proceed to elect a new PRIMARY and automatic failover occurs within seconds. However, in other error situations where the mongod process remains alive and continues to respond to heartbeats, failover will not happen, but write operations to the PRIMARY will fail, rendering the cluster unusable and de facto unavailable (for writes). An example of such error situation is a disk error such as disk full. |
| Comments |
| Comment by Daniel Pasette (Inactive) [ 29/Aug/13 ] |
|
duplicate of SERVER-9552 |