[SERVER-29446] $sample stage could not find a non-duplicate document while using a random cursor Created: 05/Jun/17 Updated: 28/Nov/23 Resolved: 01/Aug/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.4.4, 3.5.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Charlie Swanson | Assignee: | Backlog - Storage Engines Team |
| Resolution: | Duplicate | Votes: | 2 |
| Labels: | 3.7BackgroundTask, nonnyc, query-44-grooming, refinement, stability, storage-engines, todo_in_code | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||
| Assigned Teams: |
Storage Engines
|
||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.2
|
||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: | Taken from Jesse's comment on SERVER-20385. With MongoDB 3.4.4 on Mac OS X, I can reproduce this. First do "python -m pip install pymongo pytz", then:
As often as not, the sample fails ten times in a row with error code 28799 and the message: "$sample stage could not find a non-duplicate document after 100 while using a random cursor. This is likely a sporadic failure, please try again." |
||||||||||||||||||||||||||||||||||||||
| Sprint: | Storage 2017-07-31, Storage Engines 2019-08-12, Storage Engines 2019-08-26, StorEng - 2023-08-08 | ||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 19 | ||||||||||||||||||||||||||||||||||||||
| Description |
|
The error originally reported in |
| Comments |
| Comment by Manan Shah [ 28/Nov/23 ] | |||||||||||||
|
Thank you. I think at this point we have moved forward but ff there's any further issue will let you know. | |||||||||||||
| Comment by Jie Chen [ 23/Nov/23 ] | |||||||||||||
|
Hello from storage engines team! We have recently made a fix to this through | |||||||||||||
| Comment by Garaudy Etienne [ 15/Nov/22 ] | |||||||||||||
|
steve.kuhn@mongodb.com this is being hit by resharding testing btw. | |||||||||||||
| Comment by Brian Lane [ 16/Aug/19 ] | |||||||||||||
|
Hi manan@indeed.com, We have not had much success in reproducing this issue on later versions of MongoDB. I am closing this specific issue, but if you are still encountering an error, feel free to create another issue, and we will investigate. -Brian | |||||||||||||
| Comment by Githook User [ 15/Aug/19 ] | |||||||||||||
|
Author: {'name': 'samontea', 'email': 'merciers.merciers@gmail.com', 'username': 'samontea'}Message: (cherry picked from commit a9c7a1be727e1d4e181fa66385174e87f09d9512) | |||||||||||||
| Comment by Githook User [ 01/Aug/19 ] | |||||||||||||
|
Author: {'name': 'samontea', 'username': 'samontea', 'email': 'merciers.merciers@gmail.com'}Message: | |||||||||||||
| Comment by Brian Lane [ 01/Aug/19 ] | |||||||||||||
|
Hey samuel.mercier, keith.bostic has taken back -Brian | |||||||||||||
| Comment by Keith Bostic (Inactive) [ 11/Jun/19 ] | |||||||||||||
|
manan@indeed.com, I've been looking into this ticket, and I wanted to ask you for any details you can recall. When you noticed you were having "this exact same issue" on 3.4/3.6, can you confirm there were a large number of deleted records in the underlying collection? Or was it just you were seeing the same error message as listed in the ticket and we don't know what the data in the underlying collection looked like? Thank you! | |||||||||||||
| Comment by Manan Shah [ 01/May/19 ] | |||||||||||||
|
So we upgrade our MongoDB version in production to 3.6.9, and also encountered this error. However now the improvement was the mongos was able to return some 100K+ documents for the query before tripping on that error. The query asked for a sample size of 200K. In 3.4.14 the query would fail without returning a single document.
Worth noting that the parameter | |||||||||||||
| Comment by Manan Shah [ 30/Apr/19 ] | |||||||||||||
|
Hi We are having this exact same issue on version 3.4.14. Is this being fixed in 3.6 and higher versions? What do you recommend @Team MongoDB? | |||||||||||||
| Comment by Keith Bostic (Inactive) [ 03/Aug/17 ] | |||||||||||||
The original reason for the two techniques was because older versions of WiredTiger could create relatively unbalanced trees, which would cause "random" records to be entirely selected from a small part of the whole tree. Stepping a cursor through the tree fixed that problem. Current versions of WiredTiger are far less likely to create unbalanced trees, we haven't seen that problem in a long time.
The algorithm is a little more clever in that it skips entire pages of records: that can often be done without actually reading the page from disk, so we can usually avoid bringing all of the tree's leaf pages into memory. Regardless, it's a slower technique. The problem with I think we should fix that specific case by changing how WiredTiger handles leaf pages with large numbers of deleted items. | |||||||||||||
| Comment by Charlie Swanson [ 03/Aug/17 ] | |||||||||||||
|
While investigating, keith.bostic informed me that there are actually two ways that we use WiredTiger cursors to get random samples of data - the $sample stage configures a cursor passing the "next_random" config option, but elsewhere we pass the "next_random_sample_size=<size>" config option when building "oplog stones". It was unclear why we used two different techniques, but I talked to max.hirschhorn and got some insight. I believe the difference between the two techniques can be roughly summarized as:
I don't know if this was a very intentional decision, but it does seem to make more sense to use "next_random" for the $sample stage. If the "next_random_sample_size=<size>" version does behave roughly how I described, then we could expect to scan roughly all the records in the average case, even if we were only asking for a very small number of documents. However, from the perspective of the oplog stones, it is much more important to get a good distribution than it is to be quick - and it's looking for enough samples that it probably makes sense to look at everything to generate a more representative sample. keith.bostic hopefully this makes sense to you? I realize that it's not really justified right now, since in many cases the small sample size is basically ignored at this level, and we will request 4MB of documents at a minimum. However, should we fix SERVER-23408, I would expect pretty good performance for a sample of size 10. | |||||||||||||
| Comment by Alexander Gorrod [ 02/Aug/17 ] | |||||||||||||
|
Thanks for the explanation keith.bostic, it describes a case we should do better with for random cursor positioning. | |||||||||||||
| Comment by Keith Bostic (Inactive) [ 31/Jul/17 ] | |||||||||||||
|
alexander.gorrod, I was part-way through figuring this one out, and charlie.swanson and I finished that investigation today. In short, the first run of the script loads 10K records into a collection, then the second script run deletes all the records from the first run and loads a different set of 10K records (that is, the RecordId is monotonically incrementing across the two inserts). What we end up with is two disk-backed pages in memory, each with some number of disk-backed deleted records, and one of which has an insert list with 10K entries, no records of which are deleted. The random-cursor implementation always prefers disk-backed records to records in an insert list (probably because most tables are more likely to have a larger number of records in the disk-backed section than the insert list), in other words, if page has any disk-backed records, we select from them. In this case, all of the disk-backed records have been deleted, and so we randomly select a deleted record, realize it's deleted and we can't return it, then do a cursor-next to the next visible record. That consistently returns the first record from the insert list, it's the next record in the tree that's visible to us, regardless of what disk-backed record we started with. Absent detecting large sets of deleted records in the WiredTiger btree random-cursor code, maybe we could configure all MongoDB sample to use next_random_sample_size, which should at least make this better? Although I'd have to review that code to be sure, it might fail as well, in some cases. | |||||||||||||
| Comment by A. Jesse Jiryu Davis [ 31/Jul/17 ] | |||||||||||||
|
I don't remember anymore why I hit this. I may have been playing with $sample in its early days for the sake of answering a community question, and I was surprised that in my simple case it failed so frequently. | |||||||||||||
| Comment by Charlie Swanson [ 11/Jul/17 ] | |||||||||||||
|
keith.bostic I finally took an initial pass, and I am going to send this over to you to look at. My initial investigation seems like we're hitting some sort of edge cases within a wired tiger btree where lots of entries are deleted, causing us to seek to the same record over and over again. My evidence for this so far is:
Let me know if there's anything I can do to aid the investigation. I got frustrated by SERVER-23408 during this, so I found it useful to do the following:
| |||||||||||||
| Comment by Charlie Swanson [ 05/Jun/17 ] | |||||||||||||
|
I can reproduce this locally on Linux as well, and against my development version of the server, a bit ahead of 3.5.7 |