Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-20385

$sample stage could not find a non-duplicate document while using a random cursor

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      Repro'd on Ubuntu 15.04 with a local build of mongod from source.

      Extract tarball to tmp3

      ./mongod --dbpath tmp3 --port 27009
      ./mongo --port 27009
      > use mongodb
      > db.fanclub.aggregate([{$sample: {size: 120}}])
      

      Try the .aggregate query a few times (n < 5)

      Show
      Repro'd on Ubuntu 15.04 with a local build of mongod from source. Extract tarball to tmp3 ./mongod --dbpath tmp3 --port 27009 ./mongo --port 27009 > use mongodb > db.fanclub.aggregate([{$sample: {size: 120}}]) Try the .aggregate query a few times (n < 5)
    • Sprint:
      Quint 9 09/18/15

      Description

      Running 3.1.8-pre (d03334dfa87386feef4b8331f0e183d80495808c)

      > db.fanclub.aggregate([{$sample: {size: 120}}])
      assert: command failed: {
              "ok" : 0,
              "errmsg" : "$sample stage could not find a non-duplicate document after 100 while using a random cursor. This is likely a sporadic failure, please try again.",
              "code" : 28799
      } : aggregate failed
      _getErrorWithCode@src/mongo/shell/utils.js:23:13
      doassert@src/mongo/shell/assert.js:13:14
      assert.commandWorked@src/mongo/shell/assert.js:259:5
      DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1211:5
      @(shell):1:1
      

      It is indeed sporadic in my testing. Should the client ever see this message?

      I am able to reproduce this with --storageEngine=wiredTiger on a somewhat old set of files:

      $ less tmp/WiredTiger
      WiredTiger
      WiredTiger 2.5.1: (December 24, 2014)
      

      However, when I export/import that database into a new --dbpath, I am unable to repro:

      $ less tmp2/WiredTiger
      WiredTiger
      WiredTiger 2.6.2: (June  4, 2015)
      

      1. SERVER-20385-tmp3.tgz
        1.21 MB
        Matt Kangas

        Issue Links

          Activity

          Hide
          matt.kangas Matt Kangas (Inactive) added a comment -

          Confirmed fixed per the repro above. Thanks!

          Show
          matt.kangas Matt Kangas (Inactive) added a comment - Confirmed fixed per the repro above. Thanks!
          Hide
          Marmor Mor added a comment -

          I'm able to reproduce this issue on 3.2.12:

          Collection contains 1.1B documents, trying to get a $sample of 1M keep returning this error msg (3/3 tries).

          The sample size is less then 1% of the collection size, so I don't think it should be hard to get 1M unique documents statistically speaking.

          The sample works ok for 1000.

          Show
          Marmor Mor added a comment - I'm able to reproduce this issue on 3.2.12: Collection contains 1.1B documents, trying to get a $sample of 1M keep returning this error msg (3/3 tries). The sample size is less then 1% of the collection size, so I don't think it should be hard to get 1M unique documents statistically speaking. The sample works ok for 1000.
          Hide
          jesse A. Jesse Jiryu Davis added a comment -

          With MongoDB 3.4.4 on Mac OS X, I can reproduce this. First do "python -m pip install pymongo pytz", then:

          from datetime import datetime, timedelta
           
          import pytz
          from bson import ObjectId
          from pymongo import MongoClient
          from pymongo.errors import OperationFailure
           
          CHUNKS = 20
           
          collection = MongoClient().db.test
          collection.delete_many({})
           
          start = datetime(2000, 1, 1, tzinfo=pytz.UTC)
          for hour in range(10000):
              collection.insert(
                  {'_id': ObjectId.from_datetime(start + timedelta(hours=hour)), 'x': 1})
           
          for _ in range(10):
              try:
                  docs = list(collection.aggregate([{
                      "$sample": {"size": CHUNKS}
                  }, {
                      "$sort": {"_id": 1}
                  }]))
              except OperationFailure as exc:
                  if exc.code == 28799:
                      # Work around https://jira.mongodb.org/browse/SERVER-20385
                      print("retry")
                      continue
           
                  raise
           
              for d in docs:
                  print(d['_id'].generation_time)
           
              break
          else:
              raise OperationFailure("$sample failed")
          

          As often as not, the sample fails ten times in a row with error code 28799 and the message: "$sample stage could not find a non-duplicate document after 100 while using a random cursor. This is likely a sporadic failure, please try again."

          Show
          jesse A. Jesse Jiryu Davis added a comment - With MongoDB 3.4.4 on Mac OS X, I can reproduce this. First do "python -m pip install pymongo pytz", then: from datetime import datetime, timedelta   import pytz from bson import ObjectId from pymongo import MongoClient from pymongo.errors import OperationFailure   CHUNKS = 20   collection = MongoClient().db.test collection.delete_many({})   start = datetime(2000, 1, 1, tzinfo=pytz.UTC) for hour in range(10000): collection.insert( {'_id': ObjectId.from_datetime(start + timedelta(hours=hour)), 'x': 1})   for _ in range(10): try: docs = list(collection.aggregate([{ "$sample": {"size": CHUNKS} }, { "$sort": {"_id": 1} }])) except OperationFailure as exc: if exc.code == 28799: # Work around https://jira.mongodb.org/browse/SERVER-20385 print("retry") continue   raise   for d in docs: print(d['_id'].generation_time)   break else: raise OperationFailure("$sample failed") As often as not, the sample fails ten times in a row with error code 28799 and the message: "$sample stage could not find a non-duplicate document after 100 while using a random cursor. This is likely a sporadic failure, please try again."
          Hide
          jesse A. Jesse Jiryu Davis added a comment -

          The error message is still missing a word after "100", as Charlie said, it should be "100 attempts".

          Show
          jesse A. Jesse Jiryu Davis added a comment - The error message is still missing a word after "100", as Charlie said, it should be "100 attempts".
          Hide
          charlie.swanson Charlie Swanson added a comment -

          Opened SERVER-29446 to investigate

          Show
          charlie.swanson Charlie Swanson added a comment - Opened SERVER-29446 to investigate

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                  Agile