Client sessions are not causally consistent for sharded metadata

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Sharding
    • None
    • Catalog and Routing
    • Sharding EMEA 2021-11-29, Sharding EMEA 2021-12-13, Sharding EMEA 2021-12-27
    • 0
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Original description:
      Gossiping the cluster time for causally consistent client sessions ensures causality for queries but it's not enough to ensure causality for sharded metadata operations.

      Scenario:
      * One client communicating with 2 routers A and B
      * Router A performs shardCollection on db.coll
      * Router B receives a query for db.coll, refreshes from the CSRS with nearest read preference and read concern with minimum last known config time
      * The CSRS secondary serving the refresh still didn't replicate the config.collections entry for db.coll
      * Router B treats db.coll as unsharded collection

      At some point, router B will get to know that db.coll is sharded (e.g. after targeting the primary). However, gossiping the configOpTime and the topologyTime to the client would have allowed the second router to be causally consistent when refershing with afterClusterTime >= last known configOpTime.


      Refined description:

      Gossiping the cluster time for causally consistent client sessions ensures causality for queries and writes, because these operations always terminate on a shard. However, there are some sharding metadata operations that don’t terminate on a shard and instead terminate on a router. From the point of view of an application using causal consistency, these operations might not appear to be causally consistent.

      These are commands that read sharding metadata from the config server (through the catalog cache or the CatalogClient) and then make a decision purely based on that metadata, without ever talking to a shard. Some of these code paths refresh or read with a non‑primary read preference (for example, nearest), which means a mongos can “refresh on every request” and still see a stale view of the metadata if it happens to talk to a lagging config secondary replica.

      Some examples of such commands are:

      • mergeChunks, split, clearJumbo: refresh from the catalog cache with readPreference=nearest and then validate shard key bounds or jumbo flags.
      • moveChunk: Refreshes with readPreference=nearest. Then the command may reject a request if it believes the collection is unsharded.
      • listShards: refreshes with readPreference=nearest and returns the shards it found. It can miss a recent topology change (e.g. a newly added shard).

      Some ideas for fixing this are:
      a) Gossiping configTime and topologyTime between routers and clients, and having the client gossip them back to routers, so that the refresh can use the gossiped configTime/topologyTime as afterClusterTIme. This is fairly invasive and would probably require changes to the client-router protocols. So this is likely an overkill, considering the commands that are affected.
      b) Make these metadata operations always refresh using readPreference=primaryPreferred or primaryOnly. This way, they will always see the latest data and appear causally consistent. This is much simpler. Note that for it to be completely correct we likely need the “leader leases” work, to ensure that reads are directed to the true primary even in split-brain scenarios.

            Assignee:
            Unassigned
            Reporter:
            Pierlauro Sciarelli
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated: