[SERVER-15353] MongoDB crash left on shard unable to recover Created: 23/Sep/14  Updated: 23/Sep/14  Resolved: 23/Sep/14

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Eric Coutu Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

Setup: 8 shard cluster, each shard a replica set consisting of a primary, 2 secondaries, a hidden secondary (for backups) and an arbiter.

We were in the process of resyncing two nodes on one of our shards (to release disk space to the operating system), when the remaining data replicating secondaries simultaneously crashed (ran out of disk space).

One of the nodes that was in the process of resyncing appears to have finished - it was in the recovering state and had reached the same level of disk usage as other nodes in the shard. I immediately backed up the data directory of this node.

I've tried redeploying the shard using the salvaged data directory from this node, but the replica set doesn't want to elect a primary - all nodes stay in the STARTUP2 state: "initial sync need a member to be primary or secondary to do our initial sync". I can start the nodes as standalone's and access the data - I need this shard to reform a replica set so the cluster can perform again. I'm not worried about data inconsistency, as most of it is "relatively" volatile.

Seeing log lines such as this:

Sep 22 23:54:08 terra mongod.10001[1716]: Mon Sep 22 23:54:08.684 [rsSync] replSet initial sync pending
Sep 22 23:54:08 terra mongod.10001[1716]: Mon Sep 22 23:54:08.684 [rsSync] replSet initial sync need a member to be primary or secondary to do our initial sync

Sep 22 23:53:40 terra mongos.27017[30808]: Mon Sep 22 23:53:40.603 [ReplicaSetMonitorWatcher] warning: No primary detected for set rs1
Sep 22 23:53:40 terra mongos.27017[30808]: Mon Sep 22 23:53:40.603 [ReplicaSetMonitorWatcher] All nodes for set rs1 are down. This has happened for 7 checks in a row. Polling will stop after 23 more failed checks

Is there any way to force the replica set to reform with the data that is available?



 Comments   
Comment by Ramon Fernandez Marina [ 23/Sep/14 ]

Hi eric.coutu@sweetiq.com, glad to hear you were able to recover your replica set. Note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server and tools. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group.

Regards,
Ramón.

Comment by Eric Coutu [ 23/Sep/14 ]

Fixed it. For anyone stuck in startup2 limbo, make sure you wipe out local.* files on every instance connected to the replica set, start MongoDB processes up again, then recreate the replica set.

Generated at Thu Feb 08 03:37:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.