[v8.2] Timeseries $out routed by a stale router can repeatedly fail to create timeseries view and fail with StaleConfig

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: 7.0.0, 8.0.0, 8.2.0
    • Component/s: None
    • Catalog and Routing
    • ALL
    • 0
    • 馃煩 Routing and Topology
    • None
    • None
    • None
    • None
    • None
    • None

      In v8.2 and below it's possible to have the following situation where timeseries $out targeting a different DB does not converge and ends up bubbling StaleConfig to the user:

      1. Database "targetDb" exists on Shard1.
      2. However we have a stale router that believes "targetDb" is on Shard0.
      3. A timeseries $out from "sourceDb.sourceColl" to "targetDb.myTs" on the stale router heuristically decides to execute the $out on Shard0 since that's where it believes targetDb is. That's just a missed optimization.
      4. The $out runs almost to completion until it has to create the timeseries view on targetDb.myTs.
      5. The creation on Shard1 ends up bubbling a StaleConfig error because:
        1. Creating a legacy timeseries collection does two ShardVersion checks (one for the buckets NSS and one for the view NSS) and both can throw StaleConfig.
        2. The Shard Role loop on the ServiceEntryPoint will only refresh and retry once.
        3. The routing on cluster::createCollection is a DBPrimaryRouter so it won't retry StaleConfig.
      6. This error tears down and causes a retry of the entire $out. Neither the stale router has learned the placement of targetDb nor Shard1 has discovered the filtering metadata for targetDb.myTs, so all retries fail similarly until the max retries are exhausted and StaleConfig is bubbled up to the user.

      Notes:

      • This precise stale router setup is required so that the Shard1 never learns the filtering metadata for targetDb.myTs and keeps failing twice on (5.1).
      • On v8.3+ this is fixed by SERVER-77402 which allows multiple retries on (5.2).

      A reproducer is attached.

      The proper fix is likely to do a very targeted backport of SERVER-77402 to v8.2 and lower.

        1. repro-SERVER-123635.js
          2 kB
          Joan Bruguera Mic贸

            Assignee:
            Unassigned
            Reporter:
            Joan Bruguera Mic贸
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: