[SERVER-20385] $sample stage could not find a non-duplicate document while using a random cursor Created: 11/Sep/15 Updated: 05/Jun/17 Resolved: 03/Jun/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework, WiredTiger |
| Affects Version/s: | 3.1.7 |
| Fix Version/s: | 3.1.9 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matt Kangas | Assignee: | Keith Bostic (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Steps To Reproduce: | Repro'd on Ubuntu 15.04 with a local build of mongod from source. Extract tarball to tmp3
Try the .aggregate query a few times (n < 5) |
||||||||||||||||||||||||
| Sprint: | Quint 9 09/18/15 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
Running 3.1.8-pre (d03334dfa87386feef4b8331f0e183d80495808c)
It is indeed sporadic in my testing. Should the client ever see this message? I am able to reproduce this with --storageEngine=wiredTiger on a somewhat old set of files:
However, when I export/import that database into a new --dbpath, I am unable to repro:
|
| Comments |
| Comment by Charlie Swanson [ 05/Jun/17 ] | ||||||||||||||||||||||||||||||||||||||
|
Opened | ||||||||||||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 30/May/17 ] | ||||||||||||||||||||||||||||||||||||||
|
The error message is still missing a word after "100", as Charlie said, it should be "100 attempts". | ||||||||||||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 30/May/17 ] | ||||||||||||||||||||||||||||||||||||||
|
With MongoDB 3.4.4 on Mac OS X, I can reproduce this. First do "python -m pip install pymongo pytz", then:
As often as not, the sample fails ten times in a row with error code 28799 and the message: "$sample stage could not find a non-duplicate document after 100 while using a random cursor. This is likely a sporadic failure, please try again." | ||||||||||||||||||||||||||||||||||||||
| Comment by Mor [X] [ 05/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||
|
I'm able to reproduce this issue on 3.2.12: Collection contains 1.1B documents, trying to get a $sample of 1M keep returning this error msg (3/3 tries). The sample size is less then 1% of the collection size, so I don't think it should be hard to get 1M unique documents statistically speaking. The sample works ok for 1000. | ||||||||||||||||||||||||||||||||||||||
| Comment by Matt Kangas [ 18/Sep/15 ] | ||||||||||||||||||||||||||||||||||||||
|
Confirmed fixed per the repro above. Thanks! | ||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 16/Sep/15 ] | ||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}Message: Merge pull request #2194 from wiredtiger/server-20385
| ||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 16/Sep/15 ] | ||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'}Message: In MongoDB: first, $sample de-duplicates the keys WiredTiger returns, Remove the code that returns the first key of the page, always return | ||||||||||||||||||||||||||||||||||||||
| Comment by Keith Bostic (Inactive) [ 15/Sep/15 ] | ||||||||||||||||||||||||||||||||||||||
|
This is a WiredTiger problem, I've pushed a branch for review & merge. Apologies all around! | ||||||||||||||||||||||||||||||||||||||
| Comment by Geert Bosch [ 15/Sep/15 ] | ||||||||||||||||||||||||||||||||||||||
|
I checked the dataset, and it seems the document count etc is valid. | ||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 14/Sep/15 ] | ||||||||||||||||||||||||||||||||||||||
|
charlie.swanson, I believe the issue that you were asking about was | ||||||||||||||||||||||||||||||||||||||
| Comment by Matt Kangas [ 14/Sep/15 ] | ||||||||||||||||||||||||||||||||||||||
|
charlie.swanson, there are 10k documents in the collection being sampled (see attached tarball). Zero writes were taking place at that time; the database was otherwise entirely idle. | ||||||||||||||||||||||||||||||||||||||
| Comment by Charlie Swanson [ 14/Sep/15 ] | ||||||||||||||||||||||||||||||||||||||
|
So I now realize that log message is missing a word. It should be "after 100 attempts". How many documents are in the collection being sampled? Were there any writes taking place at the time? This error message indicates that the document returned from WiredTiger's random cursor was identical (in terms of _id), 100 times in a row. There is not a graceful way to recover from this, so we decided to just propagate this up to the user and have them try again. geert.bosch, I remember you encountered a similar problem when hooking up the random cursor, where WiredTiger always returned the same document. Do you remember what version that was fixed in? |