Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-14139

Disk failure on one node can (eventually) block a whole cluster

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical - P2
    • Resolution: Duplicate
    • None
    • None
    • Replication, Storage
    • None
    • Replication
    • ALL

    Description

      If a disk failure occurs in such a way as to block IO without returning (admittedly a rare occurrence), the affected mongod will never give up waiting for the IO to complete. Heartbeats are returned as normal, so other nodes will continue to trust it despite being permanently dysfunctional.

      A replica-set or a sharded cluster can eventually be locked up until the single faulty node is identified and terminated.

      Attachments

        Issue Links

          Activity

            People

              backlog-server-repl Backlog - Replication Team
              andrew.ryder@mongodb.com Andrew Ryder (Inactive)
              Votes:
              5 Vote for this issue
              Watchers:
              47 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: