Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Won't Fix
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Replication
Labels:
None

Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Background

In the situation when a shard is down & lost all its data, to recover this shard with minimal impact on the cluster, we would need to ensure the term (electionId)/configuration version for this recovered shard is the same/higher than the electionId/configuration version cached (for this shard) in the Replication Monitor on the mongos/CSRS/other shards. Otherwise the operation may fail with the following error (set XXX is the name of the down shard):

"Could not find host matching read preference { mode: \"primary\" } for set XXX"

Issue

Currently when we initiate the replica set configuration, we can specify the configuration version. However, there is no way to specify the (initial) term (electionId) for the replica set.

As such, for the above issue, there are some workarounds:

One workaround is to shutdown the replica set and update the term in the local.replset.election collection, then restart the shard. However, for the shard with In-Memory storage engine, this is not feasible, as the data (including the local database) will be lost when the shard is restarted.
Another workaround is to restart the whole cluster. This is quite painful especially for large sharded clusters. Also, for sharded cluster that is using the In-memory storage engine, we can't just stop all the members in the cluster at the same time. Otherwise the data on those shards will be lost. So we would need to restart the mongos/CSRS/shard members in a rolling fashion. This would require a lot of efforts.
The other workaround is to step down the primary on the shard, until the new term (election id) matches the term before the shard was down. If the term for this shard was high before the shard was down, this workaround might not feasible.

As above, those workarounds are either not feasible, or requiring a lot of efforts. It would be nice if we can specify the term/electionId when initiating the replica set.

is related to

SERVER-41871 Provide a mechanism to remove a shard and also abandon its chunks

Closed

Assignee:: Mira Carey
Reporter:: Linda Qin
Participants:: Alyson Cabral, Andy Schwerin, Gregory McKeon, Judah Schvimer, Kevin Pulo, Linda Qin, Mira Carey, Ratika Gandhi
Votes:: 0 Vote for this issue
Watchers:: 14 Start watching this issue

Created:: Mar 14 2019 03:22:35 AM UTC
Updated:: Jun 27 2019 03:43:10 PM UTC
Resolved:: Jun 27 2019 03:43:10 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates