Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-14139

Disk failure on one node can (eventually) block a whole cluster

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Critical - P2 Critical - P2
    • None
    • None
    • Replication, Storage
    • None
    • Replication
    • ALL

    Description

      If a disk failure occurs in such a way as to block IO without returning (admittedly a rare occurrence), the affected mongod will never give up waiting for the IO to complete. Heartbeats are returned as normal, so other nodes will continue to trust it despite being permanently dysfunctional.

      A replica-set or a sharded cluster can eventually be locked up until the single faulty node is identified and terminated.

      Attachments

        Activity

          People

            backlog-server-repl Backlog - Replication Team
            andrew.ryder@mongodb.com Andrew Ryder (Inactive)
            Votes:
            5 Vote for this issue
            Watchers:
            47 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: