Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-62700

The rename DDL violates some ShardServerCatalogCacheLoader constraints when the cached metadata collections are UUID-based

    • Type: Icon: Task Task
    • Resolution: Won't Fix
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Sharding
    • Sharding EMEA 2022-01-24

      When we enable the long names support, all config.cache.chunks.* collections are UUID-based. This is problematic for the rename DDL operation, let me try to describe the problem:

      • We have a collection "A" with UUID=42 that it is going to be renamed to "B" (let's assume that "B" doesn't exist yet).
      • We acquire the critical section blocking reads/writes over both namespaces.
      • After locally renaming the collection, we commit those changes in the CSRS.
      • Once the operation is committed, we have two namespaces that from the point of view of the ShardServerCatalogCacheLoader refer to the same cached chunks collection: config.cache.chunks.42. This is problematic because despite the fact that we are holding the critical section we could trigger a CatalogCache refresh. Then:
        • We could have a thread refreshing the namespace "A". In this case it will see that the collection was removed (i.e. not present anymore on config.collections and will spawn an async task to drop config.cache.chunks.42.
        • Let's assume that at the same time we have another thread refreshing the namespace "B". This one is going to find it on the CSRS and it is going to check whether config.cache.chunks.42 is created or not, and then it will append the new metadata (if any) to this collection. This is problematic because we don't synchronize the reads that we perform on this collection with the task spawned before. We didn't have this problem before because each namespace had its own and unique cached chunks collection, so it couldn't happen that two different namespaces shared the same cached chunks collection.

      I spotted this problem some time ago (SERVER-58465) and thought that because we were holding the critical section this shouldn't be a problem. I was wrong, if the shard executes some "routing code" (i.e. we get the CatalogCache and do some refreshes) we can totally hit it.

      rui.liu found this problem in this execution.

            Assignee:
            antonio.fuschetto@mongodb.com Antonio Fuschetto
            Reporter:
            sergi.mateo-bellido@mongodb.com Sergi Mateo Bellido
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: