-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Catalog and Routing
-
ALL
-
200
-
🟩 Routing and Topology
-
None
-
None
-
None
-
None
-
None
-
None
SERVER-122427 revealed that the way we handle nested router role loops within a shard role loop has a problem. In particular it can result that the router role exception is handled by the shard role. That is, that the exception received by the router role is bubbled up to shard role, meaning the shard role can end up handling a stale error resulting from a completely different shard.
Consider the following scenario:
- There are two collections: one sharded, another one unsharded and owned by the dbPrimary at shard1
- mongos starts a multi-document transaction and sends an aggregation on the sharded collection with a lookup on the unsharded collection to shard0
- shard0 proceeds to execute the aggregation and the lookup is sent against shard1 since it's the dbPrimary for the unsharded collection
- If there is a critical section on the database the request sent to shard1 fails and responds with a StaleDB error to shard0
- The router role on shard0 processes the StaleDB error and decides to bubble up the error since it's within a multi-document transaction. It can also be the case that the loop runs out of retries and decides to bubble up the error anyway.
- This error gets in turn handled by the shard role layer at shard0 and proceeds to handle the dbVersion mismatch thrown by shard1. <- This is the bug
At this point we have achieved to break the invariant that all StaleConfig errors seen by the shard role are originating from the same shard and as a result is processing a stale error from a different shard altogether.
Attached is a reproducer for this issue by artificially forcing a shard to return a StaleConfig when processing a request. The same error applies to an unsharded collection lookup.
- is depended on by
-
SERVER-122427 Inefficient resolution of convoy of database metadata routing refreshes
-
- Blocked
-
- is related to
-
SERVER-122427 Inefficient resolution of convoy of database metadata routing refreshes
-
- Blocked
-