[SERVER-29446] $sample stage could not find a non-duplicate document while using a random cursor Created: 05/Jun/17  Updated: 28/Nov/23  Resolved: 01/Aug/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.4, 3.5.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Charlie Swanson Assignee: Backlog - Storage Engines Team
Resolution: Duplicate Votes: 2
Labels: 3.7BackgroundTask, nonnyc, query-44-grooming, refinement, stability, storage-engines, todo_in_code
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
duplicates WT-8003 Fix frequent duplicate keys returned ... Closed
Related
related to SERVER-57860 $sampleFromRandomCursor should fallba... Closed
related to WT-8003 Fix frequent duplicate keys returned ... Closed
related to SERVER-43525 Complete TODO listed in SERVER-29446 Closed
is related to SERVER-20385 $sample stage could not find a non-du... Closed
is related to SERVER-42461 Allow errors in aggregates_during_bal... Closed
Assigned Teams:
Storage Engines
Operating System: ALL
Backport Requested:
v4.2
Steps To Reproduce:

Taken from Jesse's comment on SERVER-20385.

With MongoDB 3.4.4 on Mac OS X, I can reproduce this. First do "python -m pip install pymongo pytz", then:

from datetime import datetime, timedelta
 
import pytz
from bson import ObjectId
from pymongo import MongoClient
from pymongo.errors import OperationFailure
 
CHUNKS = 20
 
collection = MongoClient().db.test
collection.delete_many({})
 
start = datetime(2000, 1, 1, tzinfo=pytz.UTC)
for hour in range(10000):
    collection.insert(
        {'_id': ObjectId.from_datetime(start + timedelta(hours=hour)), 'x': 1})
 
for _ in range(10):
    try:
        docs = list(collection.aggregate([{
            "$sample": {"size": CHUNKS}
        }, {
            "$sort": {"_id": 1}
        }]))
    except OperationFailure as exc:
        if exc.code == 28799:
            # Work around https://jira.mongodb.org/browse/SERVER-20385
            print("retry")
            continue
 
        raise
 
    for d in docs:
        print(d['_id'].generation_time)
 
    break
else:
    raise OperationFailure("$sample failed")

As often as not, the sample fails ten times in a row with error code 28799 and the message: "$sample stage could not find a non-duplicate document after 100 while using a random cursor. This is likely a sporadic failure, please try again."

Sprint: Storage 2017-07-31, Storage Engines 2019-08-12, Storage Engines 2019-08-26, StorEng - 2023-08-08
Participants:
Case:
Linked BF Score: 19

 Description   

The error originally reported in SERVER-20385 is marked as fixed in 3.1.9, but we are still able to reproduce it against version 3.4.4



 Comments   
Comment by Manan Shah [ 28/Nov/23 ]

Thank you. I think at this point we have moved forward but ff there's any further issue will let you know.

Comment by Jie Chen [ 23/Nov/23 ]

Hello from storage engines team! We have recently made a fix to this through WT-11532. Please re-run the tests and let us know if the problem persists. Thank you for your patience around this issue.

Comment by Garaudy Etienne [ 15/Nov/22 ]

steve.kuhn@mongodb.com this is being hit by resharding testing btw.

Comment by Brian Lane [ 16/Aug/19 ]

Hi manan@indeed.com,

We have not had much success in reproducing this issue on later versions of MongoDB. I am closing this specific issue, but if you are still encountering an error, feel free to create another issue, and we will investigate.

-Brian

Comment by Githook User [ 15/Aug/19 ]

Author:

{'name': 'samontea', 'email': 'merciers.merciers@gmail.com', 'username': 'samontea'}

Message: SERVER-42461 Allow errors in aggregates_during_balancing.js pending work in SERVER-29446

(cherry picked from commit a9c7a1be727e1d4e181fa66385174e87f09d9512)
Branch: v4.2
https://github.com/mongodb/mongo/commit/3d9d70be9f5c824997e08a6feddd2792064aa621

Comment by Githook User [ 01/Aug/19 ]

Author:

{'name': 'samontea', 'username': 'samontea', 'email': 'merciers.merciers@gmail.com'}

Message: SERVER-42461 Allow errors in aggregates_during_balancing.js pending work in SERVER-29446
Branch: master
https://github.com/mongodb/mongo/commit/a9c7a1be727e1d4e181fa66385174e87f09d9512

Comment by Brian Lane [ 01/Aug/19 ]

Hey samuel.mercier,

keith.bostic has taken back SERVER-29446 and its in our current sprint. Will see what happens with the progress of that.

-Brian

Comment by Keith Bostic (Inactive) [ 11/Jun/19 ]

manan@indeed.com, I've been looking into this ticket, and I wanted to ask you for any details you can recall.

When you noticed you were having "this exact same issue" on 3.4/3.6, can you confirm there were a large number of deleted records in the underlying collection?

Or was it just you were seeing the same error message as listed in the ticket and we don't know what the data in the underlying collection looked like?

Thank you!

Comment by Manan Shah [ 01/May/19 ]

So we upgrade our MongoDB version in production to 3.6.9, and also encountered this error. However now the improvement was the mongos was able to return some 100K+ documents for the query before tripping on that error. The query asked for a sample size of 200K. In 3.4.14 the query would fail without returning a single document.

2019-05-01T10:01:19.556-0500 E QUERY    [thread1] Error: getMore command failed: {
        "ok" : 0,
        "errmsg" : "$sample stage could not find a non-duplicate document after 100 while using a random cursor. This is likely a sporadic failure, please try again.",
        "code" : 28799,
        "codeName" : "Location28799"
} :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
DBCommandCursor.prototype._runGetMoreCommand@src/mongo/shell/query.js:763:1
DBCommandCursor.prototype._hasNextUsingCommands@src/mongo/shell/query.js:791:9
DBCommandCursor.prototype.hasNext@src/mongo/shell/query.js:799:1
DBQuery.prototype.forEach@src/mongo/shell/query.js:500:12
@(shell):1:1
 

Worth noting that the parameter
{allowDiskUse: true{color}}
does not have any effect on the query. whether I specify or not, the results don't vary.

Comment by Manan Shah [ 30/Apr/19 ]

Hi

We are having this exact same issue on version 3.4.14.

Is this being fixed in 3.6 and higher versions? What do you recommend @Team MongoDB?

Comment by Keith Bostic (Inactive) [ 03/Aug/17 ]

charlie.swanson

It was unclear why we used two different techniques.

The original reason for the two techniques was because older versions of WiredTiger could create relatively unbalanced trees, which would cause "random" records to be entirely selected from a small part of the whole tree. Stepping a cursor through the tree fixed that problem. Current versions of WiredTiger are far less likely to create unbalanced trees, we haven't seen that problem in a long time.

I presume it does something like "include if rand() < sampleSize/totalNumRecords", but I really don't know.

The algorithm is a little more clever in that it skips entire pages of records: that can often be done without actually reading the page from disk, so we can usually avoid bringing all of the tree's leaf pages into memory. Regardless, it's a slower technique.

The problem with SERVER-29446 is it's a tiny tree with a large number of deleted records.

I think we should fix that specific case by changing how WiredTiger handles leaf pages with large numbers of deleted items.

Comment by Charlie Swanson [ 03/Aug/17 ]

While investigating, keith.bostic informed me that there are actually two ways that we use WiredTiger cursors to get random samples of data - the $sample stage configures a cursor passing the "next_random" config option, but elsewhere we pass the "next_random_sample_size=<size>" config option when building "oplog stones". It was unclear why we used two different techniques, but I talked to max.hirschhorn and got some insight.

I believe the difference between the two techniques can be roughly summarized as:

  • "next_random" will repeatedly select a random document, independent of the previous cursor position
  • "next_random_sample_size=<size>" will use a technique that scans the entire collection or index structure, deciding whether to include a record as it reaches it. I presume it does something like "include if rand() < sampleSize/totalNumRecords", but I really don't know.

I don't know if this was a very intentional decision, but it does seem to make more sense to use "next_random" for the $sample stage. If the "next_random_sample_size=<size>" version does behave roughly how I described, then we could expect to scan roughly all the records in the average case, even if we were only asking for a very small number of documents. However, from the perspective of the oplog stones, it is much more important to get a good distribution than it is to be quick - and it's looking for enough samples that it probably makes sense to look at everything to generate a more representative sample.

keith.bostic hopefully this makes sense to you? I realize that it's not really justified right now, since in many cases the small sample size is basically ignored at this level, and we will request 4MB of documents at a minimum. However, should we fix SERVER-23408, I would expect pretty good performance for a sample of size 10.

Comment by Alexander Gorrod [ 02/Aug/17 ]

Thanks for the explanation keith.bostic, it describes a case we should do better with for random cursor positioning.

Comment by Keith Bostic (Inactive) [ 31/Jul/17 ]

alexander.gorrod, I was part-way through figuring this one out, and charlie.swanson and I finished that investigation today.

In short, the first run of the script loads 10K records into a collection, then the second script run deletes all the records from the first run and loads a different set of 10K records (that is, the RecordId is monotonically incrementing across the two inserts). What we end up with is two disk-backed pages in memory, each with some number of disk-backed deleted records, and one of which has an insert list with 10K entries, no records of which are deleted.

The random-cursor implementation always prefers disk-backed records to records in an insert list (probably because most tables are more likely to have a larger number of records in the disk-backed section than the insert list), in other words, if page has any disk-backed records, we select from them.

In this case, all of the disk-backed records have been deleted, and so we randomly select a deleted record, realize it's deleted and we can't return it, then do a cursor-next to the next visible record. That consistently returns the first record from the insert list, it's the next record in the tree that's visible to us, regardless of what disk-backed record we started with.

Absent detecting large sets of deleted records in the WiredTiger btree random-cursor code, maybe we could configure all MongoDB sample to use next_random_sample_size, which should at least make this better? Although I'd have to review that code to be sure, it might fail as well, in some cases.

Comment by A. Jesse Jiryu Davis [ 31/Jul/17 ]

I don't remember anymore why I hit this. I may have been playing with $sample in its early days for the sake of answering a community question, and I was surprised that in my simple case it failed so frequently.

Comment by Charlie Swanson [ 11/Jul/17 ]

keith.bostic I finally took an initial pass, and I am going to send this over to you to look at. My initial investigation seems like we're hitting some sort of edge cases within a wired tiger btree where lots of entries are deleted, causing us to seek to the same record over and over again. My evidence for this so far is:

  • If you change the reproduction script on line 11 to drop the collection instead of doing a delete_many({}), it no longer reproduces
  • After having some fun following the cursor in gdb, I got down into here inside btr_curnext.c, and saw a cbt->page_deleted_count of 395. I didn't catch it at the beginning to see where we started within that page, but that seemed high to me.

Let me know if there's anything I can do to aid the investigation. I got frustrated by SERVER-23408 during this, so I found it useful to do the following:

// Disable batching within DocumentSourceCursor:
db.adminCommand({setParameter: 1, internalDocumentSourceCursorBatchSizeBytes: 1});
// Log when we see a duplicate document in the sample stage:
db.setLogLevel(1);

Comment by Charlie Swanson [ 05/Jun/17 ]

I can reproduce this locally on Linux as well, and against my development version of the server, a bit ahead of 3.5.7

Generated at Thu Feb 08 04:20:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.