-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Catalog
-
None
-
Catalog and Routing
-
3
Currently, the sharding bootstrap and initialization logic is scattered throughout the codebase and there are at least 4 ad-hoc created implementations of some state related to whether random disjoint parts of the sharding subsystem have reached some state of initialization.
These are:
- Grid::isInitialized()
- ShardingState::enabled()
- ShardingReady::isReady()
- ReplicaSetEndpointShardingState::supportsReplicaSetEndpoint()
All these represent complexity that leads to cognitive load and inability to reason about the state of the system. This ticket is to revamp the whole sharding bootstrap and initialization process and converge on a single "thing" to rule them all.
Ideally we should have a single object that represents it all and the recovery should consist of a single "local-data based recovery phase" and an "asynchronous phase". Something like this:
- Adding a shard should be a w:all kind of operation that is not considered successful until all nodes of a replica set have acknowledged that the node belongs to a sharded cluster. This is not an availability problem, because it is a one-time thing in the lifetime of a replica set - at the time it is added to a sharded cluster. Furthermore, we emulate this kind of behaviour, because rollback of the shard identity document will crash the node.
- The shard identity (the role of the node, the config server identity and the shard identity) should boot immediately after local recovery has completed (that is the knowledge of whether the node belongs to a sharded cluster)
- The rest of the recovery should run asynchronosly with the respective services such as the DDL coordinators pending on whatever asynchronous recovery needs to run (if any, for example if the config server needs to be contacted).
In a simplified flow it would look something like that:
– Invoke some method runShardingStateRecovery
– Look for some documents on disk and the queryable backup mode parameter
– Set-up the sharding services on the Grid
– Set-up the sharding services that are config/shard specific
– Call SS::recoveryCompleted()
- is duplicated by
-
SERVER-83604 Unify sharding initialization for config and shard role
- Closed
-
SERVER-83753 Complete TODO listed in SERVER-83326
- Closed
-
SERVER-84270 Get rid of all Grid::*initialized* methods
- Closed
-
SERVER-84407 Complete TODO listed in SERVER-84334
- Closed
- is related to
-
SERVER-83326 Investigate the ability to enable the sharding status of a node at runtime
- Closed