[SERVER-45981] Prevent duplicating action upon receiving notice that a given shard is stale Created: 05/Feb/20  Updated: 29/Oct/23  Resolved: 11/Feb/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.3.4

Type: Bug Priority: Major - P3
Reporter: Blake Oler Assignee: Blake Oler
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Documented
is documented by DOCS-13404 Investigate changes in SERVER-45981: ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2020-02-10, Sharding 2020-02-24
Participants:

 Description   

Background

When the router receives a StaleShardVersion (SSV) error from a node, it will mark either the shard as stale or the entire collection as needing refresh, depending on the requisite criteria. If an attempt is made to target to a stale shard from a router, another SSV error will be thrown. The second SSV error is thrown so that the router's retry loop will unwind and block behind a refresh.

Problem Statement

The two initial processes – 1. marking a shard as stale, and 2. blocking behind attempting to access a stale shard – interact in undesired ways. The SSV error thrown from attempting to access a stale shard has dummy information. It has empty objects for both received and wanted shard versions. As a result, when the router catches this dummy SSV, it will attempt to process the dummy SSV as if it were an SSV received from another node.

Since the dummy SSV is empty, it will cause the router to invalidate the entire collection. An empty SSV usually indicates that a collection has been dropped or that it's newly sharded, so the router's response is working as designed. However, this causes processing a stale shard to immediately mark an entire collection as stale, rendering redundant PM-1633.

Proposed Solution

Having the "shard has been marked stale" exception be the same exception type as a StaleShardVersion causes the system to double process these errors. Instead, we should create a new exception type: ShardInvalidatedForTargeting.

We will catch this new exception type at the top of the router and all other relevant command loops that would need to process a StaleShardVersion. Upon receiving this new exception, the router would mark the operation to stall on a refresh via setOperationShouldBlockBehindCatalogCacheRefresh(). However, processing this new exception would skip any processing relevant to StaleShardVersion errors, specifically marking a shard as stale or marking a collection as needs refresh.



 Comments   
Comment by Githook User [ 11/Feb/20 ]

Author:

{'name': 'Blake Oler', 'username': 'BlakeIsBlake', 'email': 'blake.oler@mongodb.com'}

Message: SERVER-45981 Prevent duplicating action upon receiving notice that a given shard is stale
Branch: master
https://github.com/mongodb/mongo/commit/8efa8a3dbe512d8f192248dbd9ecbd984d18bce2

Generated at Thu Feb 08 05:10:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.