[SERVER-14139] Disk failure on one node can (eventually) block a whole cluster Created: 03/Jun/14  Updated: 06/Dec/22  Resolved: 17/Jul/17

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Andrew Ryder (Inactive) Assignee: Backlog - Replication Team
Resolution: Duplicate Votes: 5
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-29947 Implement Storage Node Watchdog Closed
is duplicated by SERVER-15417 Arbiter didn't elect primary if OS is... Closed
is duplicated by SERVER-28422 Cluster stuck because replication hea... Closed
Related
related to SERVER-29980 Built-in hang detection diagnostics a... Closed
related to SERVER-9552 when replica set member has full disk... Backlog
Assigned Teams:
Replication
Operating System: ALL
Participants:
Case:

 Description   

If a disk failure occurs in such a way as to block IO without returning (admittedly a rare occurrence), the affected mongod will never give up waiting for the IO to complete. Heartbeats are returned as normal, so other nodes will continue to trust it despite being permanently dysfunctional.

A replica-set or a sharded cluster can eventually be locked up until the single faulty node is identified and terminated.



 Comments   
Comment by Kelsey Schubert [ 17/Jul/17 ]

Hi all,

The work Ramón referenced in his previous comment has been completed under SERVER-29947, Implement Storage Node Watchdog. Therefore, I will be closing this ticket as a duplicate. Please review SERVER-29947 for details about the functionality provided to resolve this issue.

Kind regards,
Thomas

Comment by Ramon Fernandez Marina [ 23/Mar/17 ]

Under the same condition above (primary on NFS drive mounted with -o hard,fg, NFS server killed while slow inserts happening) but using 3.4.2 and PV1 I get the same outcome: the primary blocks and the replica set becomes unusable (no election is triggered).

I'll try with -o soft, bg and see what happens. Andy mentions oplog fetches, but here the secondaries are all caught up so maybe we can try with secondaries that are behind.

EDIT: When I try with -o soft,bg the replica set ends up electing a new primary, but with -o hard,fg it doesn't seem to. Will take a closer look at the logs to compare.

EDIT #2: I think this is all explained by nfs(5):

soft / hard: determines the recovery behavior of the NFS client after an NFS request times out. If neither option is specified (or if the hard option is specified), NFS requests are retried indefinitely. If the soft option is specified, then the NFS client fails an NFS request after retrans retransmissions have been sent, causing the NFS client to return an error to the calling application.

timeo: the time in deciseconds (tenths of a second) the NFS client waits for a response before it retries an NFS request. For NFS over TCP the default timeo value is 600 (60 seconds). The NFS client performs linear back‐off: After each retransmission the timeout is increased by timeo up to the maximum of 600 seconds.

retrans: the number of times the NFS client retries a request before it attempts further recovery action. If the retrans option is not specified, the NFS client tries each request three times.

For my setup (NFS over TCP, everything else the default) I would have expected to see an election in 6 minutes (3 retries at 60, 120 and 180 secs) but it happened in 10 according to the logs. I'm going to say this is a rounding error and not investigate further because NFS is not what's under scrutiny here (see below).

While NFS is just an artifact to easily reproduce this behavior, for the purpose of the ticket it explains whether a write request will time out indefinitely (which is an event from which a primary can never recover, and unless the node is somehow taken out of the replica set will prevent the replica set from accepting writes indefinitely) or if the storage layer will eventually fail such request and return an error (which will only cripple writes until such error is returned).

This confirms that neither PV1 nor WiredTiger address the issue, so I think we should keep the ticket open to consider becoming more resilient against these insidious storage layer problems.

Regards,
Ramón.

Comment by Geert Bosch [ 23/Mar/17 ]

There are a few situations in which locks can be held indefinitely:

  • A system call (such as one performing I/O) blocks indefinitely
  • A fsyncLock command is issued

In this last case, we change the lock granting protocol to "compatibleFirst", so that reads are not blocked behind writes. However, for ordinary reads we don't change this mode, so it is possible that an exclusive write request gets blocked behind a read, and then all other reads get blocked behind that write. In this specific scenario issuing fsyncLock should still be possible, as it requests a mode that is compatible with the read. That would unblock all readers. If a global write (say, a database creation) blocks, there really is nothing we can do if the write cannot be aborted. This is not really storage engine specific. Of course the storage engine could throw a WriteConflictException or, better, some new exception that indicates write failure, but that would probably be a fair bit of work to implement.

Comment by Ramon Fernandez Marina [ 22/Mar/17 ]

Thanks for commenting in this ticket victorgp, I understand how uninterruptible sleep can be problematic in a replica set. I'd like to extend Andy's answer above on the complexity of the issue. The "right way(TM)" to address these issues in storage clusters is by using a technique called fencing:

Fencing is typically done automatically, by cluster infrastructure such as shared disk file systems, in order to protect processes from other active nodes modifying the resources during node failures. [...] Fencing is required because it is impossible to distinguish between a real failure and a temporary hang.

One common technique users can implement is STONITH, which requires specific knowledge of the systems to be isolated.

A similar option is to use a "watchdog" for each node that, upon failure detection, kills that node (as opposed to other nodes with STONITH). The pseudocode above outlines such solution, and spells out that the update call has to be made in a connection with a timeout:

// set the socket timeout on the session to the maximum timeout your application can
// tolerate in a failure - do not set it too low, because the timeout may also be caused
// by normal usage spikes in your application.
set session socket timeout to timeout period

When the update call doesn't return then the watchdog knows to kill the mongod process (or take it out of the replica set by blocking network connectivity with iptables, for example).

One could also implement a watchdog that uses fork(), where the child process runs the update call and the parent process waits for the child to return whithin a period of time; if the child process becomes blocked then the parent can kill the mongod.

Implementing a similar solution inside the mongod process can be tricky: mongod cannot step down from primary while a thread holds a write lock, which is somewhat likely if that thread is in uninterruptible sleep, so other potential solutions need to be investigated. Until one is implemented, the workaround is to roll out a monitoring system tailored to each user's systems and needs.

Regards,
Ramón.

Comment by VictorGP [ 22/Mar/17 ]

How an issue that can block the whole cluster can be opened since 2014 and there is no solution yet? Do you see how important could be for a company that hits this issue? The customers MongoDB confidence will drop, like it is happening right now to us.

We at ThousandEyes are having a similar issue in our cluster (at this moment 15 baremetal replicaset of 2 members + arb each) and we already suffered a few DB outages because of this.
This is happening at normal operations, we are not having disk failures, but somehow we are getting a huge IO wait that blocks the server IO completely and make the mongod process hangs, we are still investigating the root cause.

In the ticket i opened here SERVER-28422 i show a way to easily reproduce the problem using NFS.

That script above won't work because:

error = collection.update({_id: "watchdog"}, { lastUpdate: current time }, { upsert: true })

Will never return if the member is blocked by IO.

However, i understand the complexity of this issue, whatever monitoring/heartbeat solution you use, will have the problem of threads blocked in uninterruptible I/O sleep, somehow, this should be detected and do something, even crash the whole member like taking down the process (SIGTERM or SIGKILL) or creating a 'panic' and will remove the member from the replicaset provoking an automatic failover, or even just intentionally removing the member from the replicaset.

It is better to do that than having the whole cluster stuck.

Comment by Jonathan Reams [ 08/Jan/15 ]

There are two workarounds I found for this.

The first is to put the journal on a different (local) volume than the databases. For example, if you have a server running mongod where the dbpath is located on a FibreChannel or iSCSI volume, you would place the mongodb journal on direct-attached storage so that if there were a disruption on the FC/iSCSI volume, the journal would still be accessible. With this setup, the heartbeat thread continues to send erroneous heartbeats, but secondaries are unable to query the oplog, and mark the server as unhealthy and fail over. A caveat to this approach is that it requires writes to trigger the failover. This has been tested in both 2.6 and the 2.8 RCs.

The other work-around is similar to the python script earlier in this ticket. It doesn't have the caveat of the journal workaround and is the most fool-proof solution. It can be described in pseudocode as

// connect to the local mongod to be monitored
// typically this will be localhost
// this script uses the PID number returned by the serverStatus command
// to kill mongod, so it is unsafe to run this on another host
session = new connection to mongod
 
// set the socket timeout on the session to the maximum timeout your application can
// tolerate in a failure - do not set it too low, because the timeout may also be caused
// by normal usage spikes in your application.
set session socket timeout to timeout period
 
// some drivers won't allow you to run the serverStatus command
// without setting the read preferences to slaveOkay (e.g. mgo)
set session read preferences to slaveOkay
 
for {
	// First run the server status command to determine whether this mongod
	// is a primary and get its PID
	session.runCommand "serverStatus"
	if error {
		sleep for timeout period
		restart loop
	}
 
	// This monitoring loop only applies to primaries.
	// If the mongod is a secondary, skip this loop
	if serverStatus["repl"]["ismaster"] is false {
		sleep for timeout period
		restart loop
	}
 
	// Get the PID from the server status info - this is what will
	// be killed if the server is down
	mongodpid = serverStatus["pid"]
 
	// The name of the collection is totally arbitrary, and we
	// use the local db so the watchdog data doesn't get replicated
	collection = session.DB("local").C("watchdog")
 
	// Do an upsert against the watchdog collection, the contents of the update
	// are not important - here we set the document to be the current time
	error = collection.update({_id: "watchdog"}, { lastUpdate: current time }, { upsert: true })
	if error {
		// Kill the PID taken from the serverStatus command
		// On POSIX systems, use -9 to kill it imediately
		// On Windows systems, terminate the process.
		// The process may still be in the process table after killing it, but 
		// it will not be responding to heardbeats and an election should occur
		kill -9 mongodpid
		restart loop
	}
	sleep for timeout period
}

Comment by Ramon Fernandez Marina [ 18/Nov/14 ]

Further discussion with schwerin shed light in a tentative workaround outside mongod, so I decided to give it a spin (thanks Andy for the pointers). I configured a 2.6.5 3-node replica set as follows:

  • node1 with dbpath on a hard-mounted NFS filesystem
  • node2 with dbpath in local storage, and a NFS server to host node1's dbpath
  • node3 with dbpath in local storage

I started node1 first so it became the primary, added node2 and node3, inserted some data, and verified that it replicated correctly. Then came the fun part:

  • I launched some slow inserts on the primary:

    for (i=0; i<1000; i++) {db.foo.insert({x:i}); sleep(1000)}
    

  • On the primary (node1) I run a monitor program (below; the logic is explained in the comments)
  • Stopped the NFS server on node2

Soon enough the writes on node1 stopped, and the monitor script send a SIGKILL to mongod. I tried this twice I got two different results:

  • once the mongod process went into uninterruptible sleep and it could not be kileld (so the monitor program hung)
  • once the mongod process went into zombie state and remained there

In both cases the replica set successfully elected a new primary, so what I think this experiment shows is that:

  • there's a (arguably hacky) way to monitor a replica set, namely by inserting no-ops in the oplog. A monitor script would have to be run on every node that can be elected primary
  • there may be no other acceptable recovery than rebooting the offending node, but at least a cluster with failed storage may continue to work (even if writes to the broken primary get lost).

#!/usr/bin/env python
# monitor.py: monitor mongod's health and kill unhealthy primaries
#
# DISCLAIMER: this script as it is an ugly hack and nobody in their
# right mind should use it in production without shaping it up first and
# doing thorough testing
 
from multiprocessing import Process, Value
import sys
import time
import os
import pymongo
import signal
 
def check_status(uri, flag):
    try:
        # Connect to mongod and insert a no-op in the oplog; if that works the replicaset should be healthy
        connection = pymongo.MongoClient(uri)
        # Non-primaries are always healthy as far as this script is concerned
        if not connection.is_primary:
            flag.value=1
            sys.exit(0)
 
        db = connection.admin
        res = db.command({'appendOplogNote': 1, 'data': {'healthCheck': True}})
        code = res.get('ok', None)
        if not code or code == '0' or code == '0.0':
            sys.exit(1)
        else:   
            flag.value=1
    except:
        sys.exit(1)
    sys.exit(0)
 
def monitor():
    # FIXME: arguments are hardcoded in this example
    uri = sys.argv[1]
    interval = int(sys.argv[2])
    mongopid = int(sys.argv[3])
 
    # replSet status flag: (1=true, 0=false)
    ok = Value('i', 0) 
 
    while True:
        # Set status to not-ok and spawn the check_status process
        ok.value = 0
        p = Process(target=check_status, args=(uri, ok))
        p.start()
 
        # Sleep for a while; when we come back the status flag should be
        # set to 1, otherwise kill mongod
        time.sleep(interval)
        if ok.value == 0:
            print "Time to kill mongod pid: %i " % mongopid
            os.kill(mongopid, signal.SIGKILL)
            sys.exit(1)
 
if __name__ == '__main__':
    monitor()

Comment by Niraj Londhe [ 11/Nov/14 ]

Hi Ramon,

Monitoring outside mongod will have multiple dependency and there a could be chance of false fail-over. Do we have any other alternate way?
We need to provide the workaround to our customer till we have permanent fix available.
This will be critical if we have false failures.

Regards,
Amit

Comment by Ramon Fernandez Marina [ 06/Nov/14 ]

The best approach in the short term may be to write a script that does the necessary monitoring from outside mongod, and possibly reboots any node that has processes in D state.

Comment by Andy Schwerin [ 03/Nov/14 ]

This is a tricky problem because a thread on the primary might go into uninterruptible I/O sleep almost at any time due to disk trouble. Perhaps a watchdog timer in the storage layer could be constructed, but it will take some research. We don't want to have primaries step down due to short bursts of high disk load.

Generated at Thu Feb 08 03:33:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.