[SERVER-78997] Recreating a sharded timeseries collection as a sharded collection might lead to a read failure Created: 14/Jul/23  Updated: 23/Jan/24

Status: Backlog
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.0.0, 6.0.0, 6.3.2, 7.0.0-rc7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Marcos José Grillo Ramirez Assignee: Backlog - Catalog and Routing
Resolution: Unresolved Votes: 0
Labels: not-7.0-blocker, shardingemea-qw
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File BF-29225_repro.patch    
Issue Links:
Depends
Problem/Incident
is caused by SERVER-60144 Handle stale routing info on mongos f... Closed
Related
is related to SERVER-80719 Review timeseries name disambiguation... Backlog
is related to SERVER-80715 New refine collection shard key might... Closed
Assigned Teams:
Catalog and Routing
Operating System: ALL
Steps To Reproduce:
  1. Apply the attached patch in revision 107f24c
  2. Compile the server
  3. Run the test using the sharding suite:

     python ./buildscripts/resmoke.py run --suites=sharding ./jstests/sharding/test_find_classical_with_concurrent_drop.js --log=file
    

Sprint: Sharding EMEA 2023-09-18, Sharding EMEA 2023-10-02, Sharding EMEA 2023-10-16, Sharding EMEA 2023-10-30, CAR Team 2023-12-25, CAR Team 2024-01-08, CAR Team 2024-01-22
Participants:
Linked BF Score: 10
Story Points: 2

 Description   

SERVER-60144 added functionality to check the shard version of a time series collection by assigning the bucket NamespaceString to the operation sharding state, however, in a read path the following interleaving might happen:

  1. A sharded timeseries collection with namespace db.coll is created successfully
  2. A read comes into the shard for the timeseries collection and the shard role is set in the OperationShardingState, notice that because it is a timeseries, the namespace is transformed from db.coll to db.system.buckets.coll
  3. A drop collection and a shard collection gets executed, successfully dropping and creating a new sharded collection with namespace db.coll
  4. The read path continues, selecting a filtering phase
  5. When trying to obtain the ownership filter, we find that there is no shard version set for the collection db.coll in the operation context, so we fail the command with a tassert

Causing the read to fail. Usually on the find path, once it is detected the collection is actually a view (because timeseries collections are actually views), the read is transformed to an aggregation, but because a new sharded collection was created in step 3, the read continues as normal.

We should find a way to check the shard version in the read path, in order to determine whether the collection still exists before acquiring the lock free raii.



 Comments   
Comment by Sergi Mateo Bellido [ 08/Nov/23 ]

I believe that if we do first SERVER-80719 we might have a more general approach to tackle this problem.

Generated at Thu Feb 08 06:39:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.