Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Won't Fix
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Sharding
Labels:
- SSCCL-BUG

Sprint:
Sharding EMEA 2022-01-24
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When we enable the long names support, all config.cache.chunks.* collections are UUID-based. This is problematic for the rename DDL operation, let me try to describe the problem:

We have a collection "A" with UUID=42 that it is going to be renamed to "B" (let's assume that "B" doesn't exist yet).
We acquire the critical section blocking reads/writes over both namespaces.
After locally renaming the collection, we commit those changes in the CSRS.
Once the operation is committed, we have two namespaces that from the point of view of the ShardServerCatalogCacheLoader refer to the same cached chunks collection: config.cache.chunks.42. This is problematic because despite the fact that we are holding the critical section we could trigger a CatalogCache refresh. Then:
- We could have a thread refreshing the namespace "A". In this case it will see that the collection was removed (i.e. not present anymore on config.collections and will spawn an async task to drop config.cache.chunks.42.
- Let's assume that at the same time we have another thread refreshing the namespace "B". This one is going to find it on the CSRS and it is going to check whether config.cache.chunks.42 is created or not, and then it will append the new metadata (if any) to this collection. This is problematic because we don't synchronize the reads that we perform on this collection with the task spawned before. We didn't have this problem before because each namespace had its own and unique cached chunks collection, so it couldn't happen that two different namespaces shared the same cached chunks collection.

I spotted this problem some time ago (~~SERVER-58465~~) and thought that because we were holding the critical section this shouldn't be a problem. I was wrong, if the shard executes some "routing code" (i.e. we get the CatalogCache and do some refreshes) we can totally hit it.

rui.liu found this problem in this execution.

Assignee:: Antonio Fuschetto
Reporter:: Sergi Mateo Bellido
Participants:: Antonio Fuschetto, Pierlauro Sciarelli, Sergi Mateo Bellido
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Jan 18 2022 01:25:53 PM UTC
Updated:: Jun 17 2022 03:59:25 PM UTC
Resolved:: Jan 24 2022 08:21:23 AM UTC

Details

Description

Attachments

Activity

People

Dates