[SERVER-14139] Disk failure on one node can (eventually) block a whole cluster Created: 03/Jun/14 Updated: 06/Dec/22 Resolved: 17/Jul/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Storage |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Andrew Ryder (Inactive) | Assignee: | Backlog - Replication Team |
| Resolution: | Duplicate | Votes: | 5 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||
| Description |
|
If a disk failure occurs in such a way as to block IO without returning (admittedly a rare occurrence), the affected mongod will never give up waiting for the IO to complete. Heartbeats are returned as normal, so other nodes will continue to trust it despite being permanently dysfunctional. A replica-set or a sharded cluster can eventually be locked up until the single faulty node is identified and terminated. |
| Comments |
| Comment by Kelsey Schubert [ 17/Jul/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi all, The work Ramón referenced in his previous comment has been completed under Kind regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 23/Mar/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Under the same condition above (primary on NFS drive mounted with -o hard,fg, NFS server killed while slow inserts happening) but using 3.4.2 and PV1 I get the same outcome: the primary blocks and the replica set becomes unusable (no election is triggered). I'll try with -o soft, bg and see what happens. Andy mentions oplog fetches, but here the secondaries are all caught up so maybe we can try with secondaries that are behind. EDIT: When I try with -o soft,bg the replica set ends up electing a new primary, but with -o hard,fg it doesn't seem to. Will take a closer look at the logs to compare. EDIT #2: I think this is all explained by nfs(5):
For my setup (NFS over TCP, everything else the default) I would have expected to see an election in 6 minutes (3 retries at 60, 120 and 180 secs) but it happened in 10 according to the logs. I'm going to say this is a rounding error and not investigate further because NFS is not what's under scrutiny here (see below). While NFS is just an artifact to easily reproduce this behavior, for the purpose of the ticket it explains whether a write request will time out indefinitely (which is an event from which a primary can never recover, and unless the node is somehow taken out of the replica set will prevent the replica set from accepting writes indefinitely) or if the storage layer will eventually fail such request and return an error (which will only cripple writes until such error is returned). This confirms that neither PV1 nor WiredTiger address the issue, so I think we should keep the ticket open to consider becoming more resilient against these insidious storage layer problems. Regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Geert Bosch [ 23/Mar/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There are a few situations in which locks can be held indefinitely:
In this last case, we change the lock granting protocol to "compatibleFirst", so that reads are not blocked behind writes. However, for ordinary reads we don't change this mode, so it is possible that an exclusive write request gets blocked behind a read, and then all other reads get blocked behind that write. In this specific scenario issuing fsyncLock should still be possible, as it requests a mode that is compatible with the read. That would unblock all readers. If a global write (say, a database creation) blocks, there really is nothing we can do if the write cannot be aborted. This is not really storage engine specific. Of course the storage engine could throw a WriteConflictException or, better, some new exception that indicates write failure, but that would probably be a fair bit of work to implement. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 22/Mar/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for commenting in this ticket victorgp, I understand how uninterruptible sleep can be problematic in a replica set. I'd like to extend Andy's answer above on the complexity of the issue. The "right way(TM)" to address these issues in storage clusters is by using a technique called fencing:
One common technique users can implement is STONITH, which requires specific knowledge of the systems to be isolated. A similar option is to use a "watchdog" for each node that, upon failure detection, kills that node (as opposed to other nodes with STONITH). The pseudocode above outlines such solution, and spells out that the update call has to be made in a connection with a timeout:
When the update call doesn't return then the watchdog knows to kill the mongod process (or take it out of the replica set by blocking network connectivity with iptables, for example). One could also implement a watchdog that uses fork(), where the child process runs the update call and the parent process waits for the child to return whithin a period of time; if the child process becomes blocked then the parent can kill the mongod. Implementing a similar solution inside the mongod process can be tricky: mongod cannot step down from primary while a thread holds a write lock, which is somewhat likely if that thread is in uninterruptible sleep, so other potential solutions need to be investigated. Until one is implemented, the workaround is to roll out a monitoring system tailored to each user's systems and needs. Regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by VictorGP [ 22/Mar/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
How an issue that can block the whole cluster can be opened since 2014 and there is no solution yet? Do you see how important could be for a company that hits this issue? The customers MongoDB confidence will drop, like it is happening right now to us. We at ThousandEyes are having a similar issue in our cluster (at this moment 15 baremetal replicaset of 2 members + arb each) and we already suffered a few DB outages because of this. In the ticket i opened here That script above won't work because:
Will never return if the member is blocked by IO. However, i understand the complexity of this issue, whatever monitoring/heartbeat solution you use, will have the problem of threads blocked in uninterruptible I/O sleep, somehow, this should be detected and do something, even crash the whole member like taking down the process (SIGTERM or SIGKILL) or creating a 'panic' and will remove the member from the replicaset provoking an automatic failover, or even just intentionally removing the member from the replicaset. It is better to do that than having the whole cluster stuck. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jonathan Reams [ 08/Jan/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There are two workarounds I found for this. The first is to put the journal on a different (local) volume than the databases. For example, if you have a server running mongod where the dbpath is located on a FibreChannel or iSCSI volume, you would place the mongodb journal on direct-attached storage so that if there were a disruption on the FC/iSCSI volume, the journal would still be accessible. With this setup, the heartbeat thread continues to send erroneous heartbeats, but secondaries are unable to query the oplog, and mark the server as unhealthy and fail over. A caveat to this approach is that it requires writes to trigger the failover. This has been tested in both 2.6 and the 2.8 RCs. The other work-around is similar to the python script earlier in this ticket. It doesn't have the caveat of the journal workaround and is the most fool-proof solution. It can be described in pseudocode as
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 18/Nov/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Further discussion with schwerin shed light in a tentative workaround outside mongod, so I decided to give it a spin (thanks Andy for the pointers). I configured a 2.6.5 3-node replica set as follows:
I started node1 first so it became the primary, added node2 and node3, inserted some data, and verified that it replicated correctly. Then came the fun part:
Soon enough the writes on node1 stopped, and the monitor script send a SIGKILL to mongod. I tried this twice I got two different results:
In both cases the replica set successfully elected a new primary, so what I think this experiment shows is that:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Niraj Londhe [ 11/Nov/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Ramon, Monitoring outside mongod will have multiple dependency and there a could be chance of false fail-over. Do we have any other alternate way? Regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 06/Nov/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The best approach in the short term may be to write a script that does the necessary monitoring from outside mongod, and possibly reboots any node that has processes in D state. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andy Schwerin [ 03/Nov/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This is a tricky problem because a thread on the primary might go into uninterruptible I/O sleep almost at any time due to disk trouble. Perhaps a watchdog timer in the storage layer could be constructed, but it will take some research. We don't want to have primaries step down due to short bursts of high disk load. |