Details
-
Question
-
Resolution: Done
-
Major - P3
-
None
-
None
-
None
-
None
Description
Setup: 8 shard cluster, each shard a replica set consisting of a primary, 2 secondaries, a hidden secondary (for backups) and an arbiter.
We were in the process of resyncing two nodes on one of our shards (to release disk space to the operating system), when the remaining data replicating secondaries simultaneously crashed (ran out of disk space).
One of the nodes that was in the process of resyncing appears to have finished - it was in the recovering state and had reached the same level of disk usage as other nodes in the shard. I immediately backed up the data directory of this node.
I've tried redeploying the shard using the salvaged data directory from this node, but the replica set doesn't want to elect a primary - all nodes stay in the STARTUP2 state: "initial sync need a member to be primary or secondary to do our initial sync". I can start the nodes as standalone's and access the data - I need this shard to reform a replica set so the cluster can perform again. I'm not worried about data inconsistency, as most of it is "relatively" volatile.
Seeing log lines such as this:
Sep 22 23:54:08 terra mongod.10001[1716]: Mon Sep 22 23:54:08.684 [rsSync] replSet initial sync pending
Sep 22 23:54:08 terra mongod.10001[1716]: Mon Sep 22 23:54:08.684 [rsSync] replSet initial sync need a member to be primary or secondary to do our initial sync
Sep 22 23:53:40 terra mongos.27017[30808]: Mon Sep 22 23:53:40.603 [ReplicaSetMonitorWatcher] warning: No primary detected for set rs1
Sep 22 23:53:40 terra mongos.27017[30808]: Mon Sep 22 23:53:40.603 [ReplicaSetMonitorWatcher] All nodes for set rs1 are down. This has happened for 7 checks in a row. Polling will stop after 23 more failed checks
Is there any way to force the replica set to reform with the data that is available?