[SERVER-40118] Allow users to initiate the replica set with the specified term (election id) Created: 14/Mar/19  Updated: 27/Jun/19  Resolved: 27/Jun/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Linda Qin Assignee: Mira Carey
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-41871 Provide a mechanism to remove a shard... Closed
Participants:

 Description   

Background

In the situation when a shard is down & lost all its data, to recover this shard with minimal impact on the cluster, we would need to ensure the term (electionId)/configuration version for this recovered shard is the same/higher than the electionId/configuration version cached (for this shard) in the Replication Monitor on the mongos/CSRS/other shards. Otherwise the operation may fail with the following error (set XXX is the name of the down shard):

"Could not find host matching read preference { mode: \"primary\" } for set XXX"

Issue

Currently when we initiate the replica set configuration, we can specify the configuration version. However, there is no way to specify the (initial) term (electionId) for the replica set.

As such, for the above issue, there are some workarounds:

  • One workaround is to shutdown the replica set and update the term in the local.replset.election collection, then restart the shard. However, for the shard with In-Memory storage engine, this is not feasible, as the data (including the local database) will be lost when the shard is restarted.
  • Another workaround is to restart the whole cluster. This is quite painful especially for large sharded clusters. Also, for sharded cluster that is using the In-memory storage engine, we can't just stop all the members in the cluster at the same time. Otherwise the data on those shards will be lost. So we would need to restart the mongos/CSRS/shard members in a rolling fashion. This would require a lot of efforts.
  • The other workaround is to step down the primary on the shard, until the new term (election id) matches the term before the shard was down. If the term for this shard was high before the shard was down, this workaround might not feasible.

As above, those workarounds are either not feasible, or requiring a lot of efforts. It would be nice if we can specify the term/electionId when initiating the replica set.



 Comments   
Comment by Ratika Gandhi [ 27/Jun/19 ]

Will be addressed by SERVER-41871.

Comment by Linda Qin [ 17/May/19 ]

To clarify, we can still access the data on other shards, we just can't return queries that include the down shard in the target, right?

Yes, that's correct. Targeted query that doesn't hit the down shard could still return the result.

Putting aside the in-memory problems for now, would it be best to have a way to force the shard to be removed and solve the allowPartialResults problem?

Yup. I think this would be very nice.

Comment by Alyson Cabral (Inactive) [ 17/May/19 ]

Thanks for your detailed response, Linda! To clarify, we can still access the data on other shards, we just can't return queries that include the down shard in the target, right? 

It seems to me like you'd rarely want to keep the meta data around for the lost data. That could lead a shard to believe it has much more data than it actually has in practice. Putting aside the in-memory problems for now, would it be best to have a way to force the shard to be removed and solve the allowPartialResults problem? 

Comment by Linda Qin [ 17/May/19 ]

This is could happen for any engine. But for the in-memory storage engine, it's harder to recover. For the other engines that persist the replica set configurations in the local database, we could hack the term in the local.replset.election collection to workaround this issue, but this is impossible for the in-memory storage engine.

Does this problem exist if you remove the shard and add a new shard? What is the benefit of keeping a shard that has no backups?

This problem should not exist if we remove the shard and add a new shard. However, in the situation discussed here (a shard is down and lose all its data), it looks to me that it's impossible to remove this shard without starting it up (as this shard has the chunks on it and removing this shard would need to move all the chunks on it to the other shards. We could hack the config database to re-assign the chunks on this down shard to the other shards, but I assume that this would require to restart the whole cluster - for a sharded cluster with in-memory storage engine, this would mean losing all the data on the other shards too...).

The benefit of keeping that shard that has no backups, is that the customer would still be able to access the data on the other shards, without the need to re-build the whole cluster (which includes shard the collections, pre-split the chunks etc).

Also note that currently AllowPartialResults does not working when sharded cluster is backed by replica set (SERVER-31511). When a shard is down, the customer won't be able to access the data on the other shards, so we would like to bring the down shard back as soon as possible.

Comment by Andy Schwerin [ 16/May/19 ]

I think it does have backups. This is about putting the shard back together from a backup.

Edit Wait, that's a different ticket, I guess? This does seem to say "all data lost". I think there's a related case where you're restoring some stale data because it's all you've got.

Comment by Gregory McKeon (Inactive) [ 16/May/19 ]

Does this problem exist if you remove the shard and add a new shard? What is the benefit of keeping a shard that has no backups? kevin.pulo

Comment by Kevin Pulo [ 23/Apr/19 ]

Sharded clusters in general. The only requirement is that all members of the shard are irrevocably lost (with no backups), and the shard is then re-created from scratch with rs.initiate(). This situation is more likely with in-memory, but can happen for any engine.

Comment by Judah Schvimer [ 22/Apr/19 ]

linda.qin, is this only possible with the in-memory storage engine, or sharded clusters in general?

Generated at Thu Feb 08 04:54:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.