[SERVER-58092] Mitigate the impact of paging-in the filtering table after a shard node restart Created: 25/Jun/21 Updated: 09/Jan/23 Resolved: 03/Mar/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.6.23, 4.2.14, 4.4.6, 4.0.25, 5.0.0-rc4 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | [DO NOT USE] Backlog - Sharding EMEA |
| Resolution: | Duplicate | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Sharding EMEA
|
||||||||
| Participants: | |||||||||
| Case: | (copied to CRM) | ||||||||
| Description |
|
On shard nodes, the routing/filtering table is kept on disk under the config.system.cache.chunks.* collections and secondary nodes request updates to these tables by contacting the primary and waiting for it to write the most-recent information. Whenever a shard receives request for the first time after restart, the steps which occur are:
Because after a restart of a node, that node is cold and potentially has some replication lag, for customers with large routing tables (millions of chunks) and high rate of writes, steps (1) and (2) above can potentially take very long time which in turn leads to minutes of downtime for operations against that node. This is normally only a problem after a crash of a node, which re-joins as secondary and impacts all secondary reads which hit that node. This ticket is a placeholder for figuring out a solution/mitigating this problem until we implement the changes to minimise the routing table size. |