[SERVER-57860] $sampleFromRandomCursor should fallback to using a collection scan when random cursor returns too many duplicate Created: 21/Jun/21 Updated: 06/Dec/22 Resolved: 28/Jun/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Arun Banala | Assignee: | Backlog - Query Execution |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Query Execution
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Query Execution 2021-07-12 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 19 | ||||||||
| Description |
|
The current logic of $sampleFromRandomCursor is to throw a uassert when the random cursor returns too many (100) duplicate records. We should instead fallback to using a collection scan in this case. |
| Comments |
| Comment by Kyle Suarez [ 28/Jun/21 ] |
|
Closing this ticket as Won't Do, as we would rather not implement the fallbacks mentioned here. They probably won't be helpful for resharding. |
| Comment by Max Hirschhorn [ 28/Jun/21 ] |
|
arun.banala, I feel like that would mean most documents in the collection are arbitrarily assigned a probability of zero by virtue of not appearing at the beginning of the collection scan then. This seems like more bias than $sample attempts to generally achieve. I had misunderstood and thought you were proposing falling back to the non storage engine optimized case of scanning through the whole collection. |
| Comment by Arun Banala [ 28/Jun/21 ] |
|
kyle.suarez max.hirschhorn, I was thinking more along the lines of return the first (N-K) documents that we find during a collection scan. Given that this scenario is unlikely to happen, I thought we can do something like this instead of failing. But we can definitely wait for storage team to investigate what's was happening in failure case. |
| Comment by Kyle Suarez [ 24/Jun/21 ] |
|
ethan.zhang and arun.banala, based on this feedback from max.hirschhorn, I think we should close this ticket as Won't Fix, and instead ask Storage Execution to prioritize the investigation in the linked failure. |
| Comment by Max Hirschhorn [ 24/Jun/21 ] |
|
ethan.zhang, with regard to resharding, I'm not sure this ticket is that desirable. I don't think it is practical to have reshardCollection scan through the entire collection to sample the data to choose new split points and then scan through the entire collection a second time to clone the actual data. The motivation for choosing the new split points via $sample was because of the storage engine optimization. My preference would be for whichever team it ends up on to get to the bottom of why the storage engine's random cursor fails to find unique documents. |
| Comment by Ethan Zhang (Inactive) [ 24/Jun/21 ] |
|
max.hirschhorn Can you shed light on what is the priority of this? |
| Comment by Ethan Zhang (Inactive) [ 22/Jun/21 ] |
|
We may need to have a flag to opt out of the fallback so we don't make the resharding performance even worse. This can be left as an implementation detail to be decided later. |
| Comment by Arun Banala [ 21/Jun/21 ] |
|
If we had already return K documents and we hit this code block, we could just open a new cursor and return the rest of the N-K using a normal collection scan. We should also ensure that all of the (N-K) are new documents by using _seenDocs. |