Two phase write operations can fail on stale router

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: 8.0.0, 8.2.0
    • Component/s: None
    • None
    • Query Execution
    • ALL
    • Hide

      Run the attached reproducible in the no_passthrough suite at commit r8.3.0-alpha0-3321-g01c498aae4a

      Show
      Run the attached reproducible in the no_passthrough suite at commit r8.3.0-alpha0-3321-g01c498aae4a
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      It could happen that write operations that use two-phase protocol fails with NamespaceNotSharded error when executed against an unsharded collections through a stale router.

       

       

      The two-phase protocol is used for write operations (updates and deletes) that cannot be directly targeted to a single shard.

      The problem happens when the router that serve the write is stale, thinks the collection is sharded, decide to use the two phase write protocol. However, when executing it, ClusterQueryWithoutShardKey receives a StaleInfo error from the shard, it will refresh its cache, retry the two-phase protocol, and finally fail with "NamespaceNotSharded".

      The problem is that the ClusterQueryWithoutShardKey command implement a router loop that swallow (intercepts and retry) the StaleInfo error. Instead, the error should be bubble up to the write executor so that after refreshing the cache and restarting the operation, it will decide to use the correct write protocol (single-phase vs two-phase) according to the refreshed metadata info.

      After the first failure, if the write operation is executed again it will succeed because the cache have been already updated.

            Assignee:
            Mihai Andrei
            Reporter:
            Tommaso Tocci
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: