Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-64863

Interoperability issues between the VectorClock and the CatalogCache on secondary nodes of the CSRS

    • Sharding EMEA
    • Fully Compatible
    • ALL
    • Sharding EMEA 2023-07-10, Sharding EMEA 2023-07-24, Sharding EMEA 2023-08-07, Sharding EMEA 2023-08-21, Sharding EMEA 2023-09-04, Sharding EMEA 2023-09-18, Sharding EMEA 2023-10-02, Sharding EMEA 2023-10-16, CAR Team 2023-11-13, CAR Team 2023-11-27, CAR Team 2023-12-11, CAR Team 2024-02-19
    • 105
    • 2

      We found this problem on the CatalogCache but I think it is a generic problem for other components that rely on some metadata stored on the config server.

      Note that the problem explained below is just for secondary nodes on the CSRS, everything is fine on the primary node.

      Let's assume that for whatever reason a secondary node of the CSRS receives a StaleShardVersion error from a shard, stating that the current shard version is X. That shard knew a configTime of T1, which is also part of the response, but it is not going to be gossiped in on the CSRS. As part of the StaleShardVersion error handling, we mark the routing info (i.e. the catalog cache) for that namespace as stale, expecting that the next time we need it we will perform a refresh. Note that is responsibility of the CSRS nodes to advance the configTime. Thus, it could happen that the secondary node had a configTime of T0 and when it perfoms a refresh it doesn't find the version X:  for config server nodes, the loader doesn't attach an $afterClusterTime.

            jordi.olivares-provencio@mongodb.com Jordi Olivares Provencio
            sergi.mateo-bellido@mongodb.com Sergi Mateo Bellido
            0 Vote for this issue
            10 Start watching this issue