Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.6.23, 4.2.14, 4.4.6, 4.0.25, 5.0.0-rc4
Component/s: None
Labels:
None

Assigned Teams:

Sharding EMEA
Case:
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

On shard nodes, the routing/filtering table is kept on disk under the config.system.cache.chunks.* collections and secondary nodes request updates to these tables by contacting the primary and waiting for it to write the most-recent information.

Whenever a shard receives request for the first time after restart, the steps which occur are:

Secondary asks the primary to refresh from the config server and write the most-recent routing info
Secondary waits for the write performed by the primary to show up locally
Secondary reads and materialises these collections into a std::map used by the OrphanFilter

Because after a restart of a node, that node is cold and potentially has some replication lag, for customers with large routing tables (millions of chunks) and high rate of writes, steps (1) and (2) above can potentially take very long time which in turn leads to minutes of downtime for operations against that node.

This is normally only a problem after a crash of a node, which re-joins as secondary and impacts all secondary reads which hit that node.

This ticket is a placeholder for figuring out a solution/mitigating this problem until we implement the changes to minimise the routing table size.

Assignee:: [DO NOT USE] Backlog - Sharding EMEA
Reporter:: Kaloian Manassiev
Participants:: [DO NOT USE] Backlog - Sharding EMEA, Kaloian Manassiev
Votes:: 2 Vote for this issue
Watchers:: 11 Start watching this issue

Created:: Jun 25 2021 02:55:04 PM UTC
Updated:: Jan 09 2023 06:54:11 AM UTC
Resolved:: Mar 03 2022 03:53:02 PM UTC

Details

Description

Attachments

Activity

People

Dates