[DOCS-13404] Investigate changes in SERVER-45981: Prevent duplicating action upon receiving notice that a given shard is stale Created: 11/Feb/20 Updated: 13/Nov/23 |
|
| Status: | Closed |
| Project: | Documentation |
| Component/s: | manual |
| Affects Version/s: | None |
| Fix Version/s: | 4.3.4, Server_Docs_20231030, Server_Docs_20231106, Server_Docs_20231105, Server_Docs_20231113 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Backlog - Core Eng Program Management Team | Assignee: | Unassigned |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | docs-sharding | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Participants: | |||||||||
| Days since reply: | 1 year, 14 weeks, 2 days ago | ||||||||
| Epic Link: | DOCSP-12974 | ||||||||
| Description |
DescriptionDownstream Change Summary A new error 'ShardInvalidatedForTargeting' can be thrown from the mongos. It is a TransientTransactionError, so it can be folded into error handling for that category of errors. Description of Linked TicketBackgroundWhen the router receives a StaleShardVersion (SSV) error from a node, it will mark either the shard as stale or the entire collection as needing refresh, depending on the requisite criteria. If an attempt is made to target to a stale shard from a router, another SSV error will be thrown. The second SSV error is thrown so that the router's retry loop will unwind and block behind a refresh. Problem StatementThe two initial processes – 1. marking a shard as stale, and 2. blocking behind attempting to access a stale shard – interact in undesired ways. The SSV error thrown from attempting to access a stale shard has dummy information. It has empty objects for both received and wanted shard versions. As a result, when the router catches this dummy SSV, it will attempt to process the dummy SSV as if it were an SSV received from another node. Since the dummy SSV is empty, it will cause the router to invalidate the entire collection. An empty SSV usually indicates that a collection has been dropped or that it's newly sharded, so the router's response is working as designed. However, this causes processing a stale shard to immediately mark an entire collection as stale, rendering redundant PM-1633. Proposed SolutionHaving the "shard has been marked stale" exception be the same exception type as a StaleShardVersion causes the system to double process these errors. Instead, we should create a new exception type: ShardInvalidatedForTargeting. We will catch this new exception type at the top of the router and all other relevant command loops that would need to process a StaleShardVersion. Upon receiving this new exception, the router would mark the operation to stall on a refresh via setOperationShouldBlockBehindCatalogCacheRefresh(). However, processing this new exception would skip any processing relevant to StaleShardVersion errors, specifically marking a shard as stale or marking a collection as needs refresh. Scope of changesImpact to Other DocsMVP (Work and Date)Resources (Scope or Design Docs, Invision, etc.) |
| Comments |
| Comment by Education Bot [ 31/Oct/22 ] |
|
Hello! This ticket has been closed due to inactivity. If you believe this ticket is still important, please reopen it and leave a comment to explain why. Thank you! |