Optimize roundtrips on routers after an operation changed the collection metadata

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Won't Do
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Sharding
    • None
    • Sharding EMEA 2021-11-15, Sharding EMEA 2021-11-29, Sharding EMEA 2021-12-13, Sharding EMEA 2021-12-27, Sharding EMEA 2022-01-10, Sharding EMEA 2022-01-24, Sharding EMEA 2022-02-07, Sharding EMEA 2022-02-21
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      The current behavior when a DDL operation just changed the full collection metadata is to advance the collection version and mark the primary shard db as stale (like for example, rename collection). This is actually an optimization originally designed to improve the performance of migrations, where only the operations related to the involved shards need to block, but other operations can be fulfilled. The unintended behavior of this is the following:

      Suppose we have a DDL operation on namespace nss that changed the collection metadata, then:

      1. The router advances the shard version and mark the database's primary shard as stale
      2. Every operation that does not target the primary database shard, will do a roundtrip and then wait for the refresh of the cache entry
      3. After marking all shards as stale, every operation for namespace nss will wait for a new version

      This is a correct behavior, however, a more optimal approach would be to wait for the new shard version. We could make a linearizable invalidation on the issuer of the DDL command, but this would only improve the performance on one router. We should come up with a way to reduce the number roundtrips once we've determined an operation changed the collection metadata.

            Assignee:
            Marcos José Grillo Ramirez
            Reporter:
            Marcos José Grillo Ramirez
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: