[SERVER-78071] A targeted query with a concurrent yield and chunk migration can miss results Created: 14/Jun/23  Updated: 26/Oct/23

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ben Shteinfeld Assignee: Backlog - Catalog and Routing
Resolution: Unresolved Votes: 0
Labels: oldshardingemea
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File shard_filter_repro.js    
Issue Links:
Depends
depends on SERVER-78724 Integrate acquisitions into aggregations Open
depends on SERVER-77507 Integrate acquisitions into Find Closed
Problem/Incident
is caused by SERVER-39191 Performance regression for counts pos... Closed
Related
related to SERVER-64128 Investigate behavior when query again... Open
Assigned Teams:
Catalog and Routing
Operating System: ALL
Participants:

 Description   

The classic engine performs an optimization which avoids introducing a shard filter stage if the query contains an equality predicate on the shard key. This has the potential to lead to incorrect query results in the following case:

1. A query with an equality on the shard key causes the planner to omit the shard filter stage.
2. The plan begins to run. A yield occurs.
3. During the yield, a chunk migration occurs and the range deleter removes the chunk which was orphaned. This can occur because the shard filter stage is the object which owns the RangePreserver - in this case, there is no RangePreserver.
4. The scan is restored which reacquires the collection.
5. When the scan continues, the deleted chunks will be gone.

See attached file for more specific repro.

The problem is that the assumptions we made around the orphans of a shard during optimization might not stay the same across yield/restores. The optimization which avoids unnecessary shard filtering is good, but currently incorrect because we conflate the concept of shard filtering and range preservation.



 Comments   
Comment by Jordi Olivares Provencio [ 27/Jul/23 ]

This issue is also present on aggregations. Linking to SERVER-78724 as it depends on it.

To reproduce this you can change the reproducer's 

find({a: 5}).batchSize(2)

 with 

aggregate([$match:{a: 5}], {cursor: batchSize: 2}})

Using acquisitions with the find integration yields the correct behaviour of the cursor pinning the data to avoid range deletion.

Comment by David Storch [ 22/Jun/23 ]

joseph.kanaan@mongodb.com / query product team: I just had a long chat with kaloian.manassiev@mongodb.com and ben.shteinfeld@mongodb.com about this. The conclusion was that we believe it will be fixed by the work that Kal's team is doing for the "Shard Role API" project. Reassigning to Sharding EMEA so that once their project is done they can confirm and add an integration test.

The reason we think it will be fixed is that the new "acquire collection" API should be responsible for keeping the RangePreserver alive for the lifetime of the query regardless of whether the plan is doing orphan filtering.

Generated at Thu Feb 08 06:37:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.