Uploaded image for project: 'Documentation'
  1. Documentation
  2. DOCS-13404

Investigate changes in SERVER-45981: Prevent duplicating action upon receiving notice that a given shard is stale

      Description

      Downstream Change Summary

      A new error 'ShardInvalidatedForTargeting' can be thrown from the mongos. It is a TransientTransactionError, so it can be folded into error handling for that category of errors.

      Description of Linked Ticket

      Background

      When the router receives a StaleShardVersion (SSV) error from a node, it will mark either the shard as stale or the entire collection as needing refresh, depending on the requisite criteria. If an attempt is made to target to a stale shard from a router, another SSV error will be thrown. The second SSV error is thrown so that the router's retry loop will unwind and block behind a refresh.

      Problem Statement

      The two initial processes – 1. marking a shard as stale, and 2. blocking behind attempting to access a stale shard – interact in undesired ways. The SSV error thrown from attempting to access a stale shard has dummy information. It has empty objects for both received and wanted shard versions. As a result, when the router catches this dummy SSV, it will attempt to process the dummy SSV as if it were an SSV received from another node.

      Since the dummy SSV is empty, it will cause the router to invalidate the entire collection. An empty SSV usually indicates that a collection has been dropped or that it's newly sharded, so the router's response is working as designed. However, this causes processing a stale shard to immediately mark an entire collection as stale, rendering redundant PM-1633.

      Proposed Solution

      Having the "shard has been marked stale" exception be the same exception type as a StaleShardVersion causes the system to double process these errors. Instead, we should create a new exception type: ShardInvalidatedForTargeting.

      We will catch this new exception type at the top of the router and all other relevant command loops that would need to process a StaleShardVersion. Upon receiving this new exception, the router would mark the operation to stall on a refresh via setOperationShouldBlockBehindCatalogCacheRefresh(). However, processing this new exception would skip any processing relevant to StaleShardVersion errors, specifically marking a shard as stale or marking a collection as needs refresh.

      Scope of changes

      Impact to Other Docs

      MVP (Work and Date)

      Resources (Scope or Design Docs, Invision, etc.)

            Assignee:
            Unassigned Unassigned
            Reporter:
            backlog-server-pm Backlog - Core Eng Program Management Team
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              1 year, 24 weeks, 3 days ago