Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Catalog
Labels:
None

Assigned Teams:

Catalog and Routing
Story Points:
3
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Currently, the sharding bootstrap and initialization logic is scattered throughout the codebase and there are at least 4 ad-hoc created implementations of some state related to whether random disjoint parts of the sharding subsystem have reached some state of initialization.

These are:

All these represent complexity that leads to cognitive load and inability to reason about the state of the system. This ticket is to revamp the whole sharding bootstrap and initialization process and converge on a single "thing" to rule them all.

Ideally we should have a single object that represents it all and the recovery should consist of a single "local-data based recovery phase" and an "asynchronous phase". Something like this:

Adding a shard should be a w:all kind of operation that is not considered successful until all nodes of a replica set have acknowledged that the node belongs to a sharded cluster. This is not an availability problem, because it is a one-time thing in the lifetime of a replica set - at the time it is added to a sharded cluster. Furthermore, we emulate this kind of behaviour, because rollback of the shard identity document will crash the node.
The shard identity (the role of the node, the config server identity and the shard identity) should boot immediately after local recovery has completed (that is the knowledge of whether the node belongs to a sharded cluster)
The rest of the recovery should run asynchronosly with the respective services such as the DDL coordinators pending on whatever asynchronous recovery needs to run (if any, for example if the config server needs to be contacted).

In a simplified flow it would look something like that:

– Invoke some method runShardingStateRecovery

– Look for some documents on disk and the queryable backup mode parameter

– Set-up the sharding services on the Grid

– Set-up the sharding services that are config/shard specific

– Call SS::recoveryCompleted()

is duplicated by

SERVER-83604 Unify sharding initialization for config and shard role

Closed

SERVER-83753 Complete TODO listed in SERVER-83326

Closed

SERVER-84270 Get rid of all Grid::*initialized* methods

Closed

SERVER-84407 Complete TODO listed in SERVER-84334

Closed

is related to

SERVER-83326 Investigate the ability to enable the sharding status of a node at runtime

Closed

Assignee:: Unassigned
Reporter:: Kaloian Manassiev
Participants:: Githook User, Kaloian Manassiev
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Apr 15 2024 01:58:54 PM UTC
Updated:: May 07 2024 02:24:00 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates