[DOCS-13404] Investigate changes in SERVER-45981: Prevent duplicating action upon receiving notice that a given shard is stale Created: 11/Feb/20  Updated: 13/Nov/23

Status: Closed
Project: Documentation
Component/s: manual
Affects Version/s: None
Fix Version/s: 4.3.4, Server_Docs_20231030, Server_Docs_20231106, Server_Docs_20231105, Server_Docs_20231113

Type: Task Priority: Major - P3
Reporter: Backlog - Core Eng Program Management Team Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: docs-sharding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
documents SERVER-45981 Prevent duplicating action upon recei... Closed
Participants:
Days since reply: 1 year, 14 weeks, 2 days ago
Epic Link: DOCSP-12974

 Description   

Description

Downstream Change Summary

A new error 'ShardInvalidatedForTargeting' can be thrown from the mongos. It is a TransientTransactionError, so it can be folded into error handling for that category of errors.

Description of Linked Ticket

Background

When the router receives a StaleShardVersion (SSV) error from a node, it will mark either the shard as stale or the entire collection as needing refresh, depending on the requisite criteria. If an attempt is made to target to a stale shard from a router, another SSV error will be thrown. The second SSV error is thrown so that the router's retry loop will unwind and block behind a refresh.

Problem Statement

The two initial processes – 1. marking a shard as stale, and 2. blocking behind attempting to access a stale shard – interact in undesired ways. The SSV error thrown from attempting to access a stale shard has dummy information. It has empty objects for both received and wanted shard versions. As a result, when the router catches this dummy SSV, it will attempt to process the dummy SSV as if it were an SSV received from another node.

Since the dummy SSV is empty, it will cause the router to invalidate the entire collection. An empty SSV usually indicates that a collection has been dropped or that it's newly sharded, so the router's response is working as designed. However, this causes processing a stale shard to immediately mark an entire collection as stale, rendering redundant PM-1633.

Proposed Solution

Having the "shard has been marked stale" exception be the same exception type as a StaleShardVersion causes the system to double process these errors. Instead, we should create a new exception type: ShardInvalidatedForTargeting.

We will catch this new exception type at the top of the router and all other relevant command loops that would need to process a StaleShardVersion. Upon receiving this new exception, the router would mark the operation to stall on a refresh via setOperationShouldBlockBehindCatalogCacheRefresh(). However, processing this new exception would skip any processing relevant to StaleShardVersion errors, specifically marking a shard as stale or marking a collection as needs refresh.

Scope of changes

Impact to Other Docs

MVP (Work and Date)

Resources (Scope or Design Docs, Invision, etc.)



 Comments   
Comment by Education Bot [ 31/Oct/22 ]

Hello! This ticket has been closed due to inactivity. If you believe this ticket is still important, please reopen it and leave a comment to explain why. Thank you!

Generated at Thu Feb 08 08:07:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.