[SERVER-74197] StaleConfig exceptions should not escape from the RouterRole loop Created: 20/Feb/23  Updated: 16/Jan/24  Resolved: 16/Jan/24

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Jordi Olivares Provencio
Resolution: Gone away Votes: 0
Labels: oldshardingemea
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-74380 Router role retry loop should allow n... Closed
Assigned Teams:
Catalog and Routing
Operating System: ALL
Sprint: CAR Team 2023-12-25, CAR Team 2024-01-08, CAR Team 2024-01-22
Participants:
Linked BF Score: 107

 Description   

The current StaleConfig exceptions are a form of WriteConflictException which are a way for the ShardRole to indicate to the upstream RouterRole that it didn't route to the correct shard. By this definition, they should never escape the RouterRole loop, because there could be an upstream RouterRole loop which can incorrectly misinterpret the exception to mean it routed to the wrong place for one collection, while in fact the upstream router didn't even route to a ShardRole.

An example of this would be a recursive tree of $lookups operating on views.

In order to catch such exceptions wrongly propagated up the tree, the existing router role loop(s) have invariants here and here. Without this invariant, it is possible that the client of the router role loop (i.e., the lambda inside) makes a mistake and uses the routing info provided for namespace 1, but attaches it to namespace 2.

Currently, it is possible that a certain combination of {{$lookup}}s operating on views triggers this invariant.

This ticket is to introduce a different kind of StaleConfig exception which just indicates to the upstream router that it can retry and there is no action to be taken, such as refreshing.



 Comments   
Comment by Jordi Olivares Provencio [ 16/Jan/24 ]

Closing this as Gone Away since the issue doesn't exist on master anymore and could be at most a theoretical liveness issue that would fail after 10 attempts of an operation that is already failing anyway.

Currently the only way we could hit this tassert in a multi-document transaction is by calling the shardVersionRetry helper functions. These functions are mostly used internally by DDL commands and so cannot be used by a multi-document transaction. The only place I can imagine that is accessible by the user is the MongosProcessInterface::lookupSingleDocument method which is only used internally for change stream manipulation and resharding. These two methods cannot be used in multi-document transactions as well.

Comment by Jordi Olivares Provencio [ 21/Dec/23 ]

After a discussion with kaloian.manassiev@mongodb.com we discovered this was an issue at the time of writing for views in particular. However, this has been fixed by SERVER-81233 on master.

It could still a problem for the following situations:

  • We are in a distributed transaction and some downstream router can't retry.
  • Some downstream router exhausted the 10 retry attempts (this is extremely rare and could be due to migrations happening too quickly)

Multi-document transactions would fall under the first case since the routers would pass upstream the failure as it can't be retried. The upstream router would get confused since the exception would be for a different namespace altogether.

Generated at Thu Feb 08 06:26:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.