[SERVER-74197] StaleConfig exceptions should not escape from the RouterRole loop Created: 20/Feb/23 Updated: 16/Jan/24 Resolved: 16/Jan/24 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | Jordi Olivares Provencio |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | oldshardingemea | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Catalog and Routing
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | CAR Team 2023-12-25, CAR Team 2024-01-08, CAR Team 2024-01-22 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 107 | ||||||||
| Description |
|
The current StaleConfig exceptions are a form of WriteConflictException which are a way for the ShardRole to indicate to the upstream RouterRole that it didn't route to the correct shard. By this definition, they should never escape the RouterRole loop, because there could be an upstream RouterRole loop which can incorrectly misinterpret the exception to mean it routed to the wrong place for one collection, while in fact the upstream router didn't even route to a ShardRole. An example of this would be a recursive tree of $lookups operating on views. In order to catch such exceptions wrongly propagated up the tree, the existing router role loop(s) have invariants here and here. Without this invariant, it is possible that the client of the router role loop (i.e., the lambda inside) makes a mistake and uses the routing info provided for namespace 1, but attaches it to namespace 2. Currently, it is possible that a certain combination of {{$lookup}}s operating on views triggers this invariant. This ticket is to introduce a different kind of StaleConfig exception which just indicates to the upstream router that it can retry and there is no action to be taken, such as refreshing. |
| Comments |
| Comment by Jordi Olivares Provencio [ 16/Jan/24 ] |
|
Closing this as Gone Away since the issue doesn't exist on master anymore and could be at most a theoretical liveness issue that would fail after 10 attempts of an operation that is already failing anyway. Currently the only way we could hit this tassert in a multi-document transaction is by calling the shardVersionRetry helper functions. These functions are mostly used internally by DDL commands and so cannot be used by a multi-document transaction. The only place I can imagine that is accessible by the user is the MongosProcessInterface::lookupSingleDocument method which is only used internally for change stream manipulation and resharding. These two methods cannot be used in multi-document transactions as well. |
| Comment by Jordi Olivares Provencio [ 21/Dec/23 ] |
|
After a discussion with kaloian.manassiev@mongodb.com we discovered this was an issue at the time of writing for views in particular. However, this has been fixed by It could still a problem for the following situations:
Multi-document transactions would fall under the first case since the routers would pass upstream the failure as it can't be retried. The upstream router would get confused since the exception would be for a different namespace altogether. |