[SERVER-33247] Repair replicated and sharded collection UUIDs Created: 09/Feb/18  Updated: 09/Jul/18  Resolved: 29/Jun/18

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Xiangyu Yao (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Sprint: Storage NYC 2018-07-02
Participants:

 Description   

In v3.6, --repair sort of handles lack of UUIDs in some collections by placing the node in kUpgradingTo36 FCV mode.

In v4.0, however, UUIDs should always be present, so there is no mode we can put it in where it is acceptable for UUIDs not to be present. Therefore, we shall fix up the UUIDs on --repair if any are missing. SERVER-33151 makes startup require all collections to have UUIDs. We can easily generate UUIDs on startup for non-replicated collections without complication (this is done by SERVER-33246). However, replicated collections require a cross replica set repair solution; and sharded collection require a cross cluster repair solution. That is what this ticket is for.

The sharding catalog also persists UUIDs in the config.collections collection on config servers, and config.cache.collections collection on shards. These may have to be considered as well.



 Comments   
Comment by Dianna Hohensee (Inactive) [ 09/Jul/18 ]

Sounds good. Thanks for the explanation.

Comment by Xiangyu Yao (Inactive) [ 06/Jul/18 ]

how would the UUID be fixed in standalone mode?

It would require the server to downgrade to v3.6 and use v3.6 mechanism to fix the UUIDs.

And can initial sync be provoked when up to date data-wise, and will it get a UUID for a namespace it already has?

Initial sync method requires users to manually wipe out all the data and re-sync from a node that has UUIDs.

Can the UUID be dictated, so a sharded cluster can be repaired? Writes to the config server's sharding metadata would be necessary as well.

I think it would be a huge issue if there is any UUID inconsistency between shards, between config servers or between shards and config servers. Given that there could be all sorts of different scenarios and the rarity of the problem, a manual fix is better than a one command fix.

The overall idea is that this scenario should be rare and if it happened, manual fixes should be involved. Even fixing UUIDs on standalone by "--repair" would be dangerous because the usage is not what users expect: users might see it as a way to fix a node which has some UUIDs missing but if other nodes in the replica set have the UUIDs, an initial sync is actually the right solution.

Comment by Dianna Hohensee (Inactive) [ 02/Jul/18 ]

xiangyu.yao, how would the UUID be fixed in standalone mode? And can initial sync be provoked when up to date data-wise, and will it get a UUID for a namespace it already has? Can the UUID be dictated, so a sharded cluster can be repaired? Writes to the config server's sharding metadata would be necessary as well.

Should this manual recovery process be documented, and an appropriate error message given to users when UUIDs are missing? The mongod server won't start up without all UUIDs, I believe, so the fix is somewhat complicated.

Comment by Xiangyu Yao (Inactive) [ 29/Jun/18 ]

In the distributed environment, if a node has UUIDs missing, it should do an initial sync rather than repair. If all the nodes in the replica set don't have the UUID for a collection, we should fix one node in standalone mode and let other nodes initial sync from it.
I'm closing this ticket as Won't Fix.

Generated at Thu Feb 08 04:32:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.