Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-15353

MongoDB crash left on shard unable to recover

    XMLWordPrintableJSON

Details

    • Icon: Question Question
    • Resolution: Done
    • Icon: Major - P3 Major - P3
    • None
    • None
    • None
    • None

    Description

      Setup: 8 shard cluster, each shard a replica set consisting of a primary, 2 secondaries, a hidden secondary (for backups) and an arbiter.

      We were in the process of resyncing two nodes on one of our shards (to release disk space to the operating system), when the remaining data replicating secondaries simultaneously crashed (ran out of disk space).

      One of the nodes that was in the process of resyncing appears to have finished - it was in the recovering state and had reached the same level of disk usage as other nodes in the shard. I immediately backed up the data directory of this node.

      I've tried redeploying the shard using the salvaged data directory from this node, but the replica set doesn't want to elect a primary - all nodes stay in the STARTUP2 state: "initial sync need a member to be primary or secondary to do our initial sync". I can start the nodes as standalone's and access the data - I need this shard to reform a replica set so the cluster can perform again. I'm not worried about data inconsistency, as most of it is "relatively" volatile.

      Seeing log lines such as this:

      Sep 22 23:54:08 terra mongod.10001[1716]: Mon Sep 22 23:54:08.684 [rsSync] replSet initial sync pending
      Sep 22 23:54:08 terra mongod.10001[1716]: Mon Sep 22 23:54:08.684 [rsSync] replSet initial sync need a member to be primary or secondary to do our initial sync

      Sep 22 23:53:40 terra mongos.27017[30808]: Mon Sep 22 23:53:40.603 [ReplicaSetMonitorWatcher] warning: No primary detected for set rs1
      Sep 22 23:53:40 terra mongos.27017[30808]: Mon Sep 22 23:53:40.603 [ReplicaSetMonitorWatcher] All nodes for set rs1 are down. This has happened for 7 checks in a row. Polling will stop after 23 more failed checks

      Is there any way to force the replica set to reform with the data that is available?

      Attachments

        Activity

          People

            Unassigned Unassigned
            quickdry21 Eric Coutu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: