Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-20385

$sample stage could not find a non-duplicate document while using a random cursor

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      Repro'd on Ubuntu 15.04 with a local build of mongod from source.

      Extract tarball to tmp3

      ./mongod --dbpath tmp3 --port 27009
      ./mongo --port 27009
      > use mongodb
      > db.fanclub.aggregate([{$sample: {size: 120}}])
      

      Try the .aggregate query a few times (n < 5)

      Show
      Repro'd on Ubuntu 15.04 with a local build of mongod from source. Extract tarball to tmp3 ./mongod --dbpath tmp3 --port 27009 ./mongo --port 27009 > use mongodb > db.fanclub.aggregate([{$sample: {size: 120}}]) Try the .aggregate query a few times (n < 5)
    • Sprint:
      Quint 9 09/18/15

      Description

      Running 3.1.8-pre (d03334dfa87386feef4b8331f0e183d80495808c)

      > db.fanclub.aggregate([{$sample: {size: 120}}])
      assert: command failed: {
              "ok" : 0,
              "errmsg" : "$sample stage could not find a non-duplicate document after 100 while using a random cursor. This is likely a sporadic failure, please try again.",
              "code" : 28799
      } : aggregate failed
      _getErrorWithCode@src/mongo/shell/utils.js:23:13
      doassert@src/mongo/shell/assert.js:13:14
      assert.commandWorked@src/mongo/shell/assert.js:259:5
      DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1211:5
      @(shell):1:1
      

      It is indeed sporadic in my testing. Should the client ever see this message?

      I am able to reproduce this with --storageEngine=wiredTiger on a somewhat old set of files:

      $ less tmp/WiredTiger
      WiredTiger
      WiredTiger 2.5.1: (December 24, 2014)
      

      However, when I export/import that database into a new --dbpath, I am unable to repro:

      $ less tmp2/WiredTiger
      WiredTiger
      WiredTiger 2.6.2: (June  4, 2015)
      

      1. SERVER-20385-tmp3.tgz
        1.21 MB
        Matt Kangas

        Issue Links

          Activity

          Hide
          charlie.swanson Charlie Swanson added a comment - - edited

          So I now realize that log message is missing a word. It should be "after 100 attempts".

          How many documents are in the collection being sampled? Were there any writes taking place at the time?

          This error message indicates that the document returned from WiredTiger's random cursor was identical (in terms of _id), 100 times in a row. There is not a graceful way to recover from this, so we decided to just propagate this up to the user and have them try again.

          Geert Bosch, I remember you encountered a similar problem when hooking up the random cursor, where WiredTiger always returned the same document. Do you remember what version that was fixed in?

          Show
          charlie.swanson Charlie Swanson added a comment - - edited So I now realize that log message is missing a word. It should be "after 100 attempts". How many documents are in the collection being sampled? Were there any writes taking place at the time? This error message indicates that the document returned from WiredTiger's random cursor was identical (in terms of _id), 100 times in a row. There is not a graceful way to recover from this, so we decided to just propagate this up to the user and have them try again. Geert Bosch , I remember you encountered a similar problem when hooking up the random cursor, where WiredTiger always returned the same document. Do you remember what version that was fixed in?
          Hide
          matt.kangas Matt Kangas (Inactive) added a comment -

          Charlie Swanson, there are 10k documents in the collection being sampled (see attached tarball). Zero writes were taking place at that time; the database was otherwise entirely idle.

          Show
          matt.kangas Matt Kangas (Inactive) added a comment - Charlie Swanson , there are 10k documents in the collection being sampled (see attached tarball). Zero writes were taking place at that time; the database was otherwise entirely idle.
          Hide
          pasette Dan Pasette added a comment -

          Charlie Swanson, I believe the issue that you were asking about was WT-2032. Resolved in 3.1.7. I'm wondering if there is something peculiar with how the data is laid out. Would need Keith Bostic to take a look at the data files.

          Show
          pasette Dan Pasette added a comment - Charlie Swanson , I believe the issue that you were asking about was WT-2032 . Resolved in 3.1.7. I'm wondering if there is something peculiar with how the data is laid out. Would need Keith Bostic to take a look at the data files.
          Hide
          geert.bosch Geert Bosch added a comment - - edited

          I checked the dataset, and it seems the document count etc is valid.

          Show
          geert.bosch Geert Bosch added a comment - - edited I checked the dataset, and it seems the document count etc is valid.
          Hide
          keith.bostic Keith Bostic added a comment -

          This is a WiredTiger problem, I've pushed a branch for review & merge. Apologies all around!

          Show
          keith.bostic Keith Bostic added a comment - This is a WiredTiger problem, I've pushed a branch for review & merge. Apologies all around!
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'}

          Message: SERVER-20385: the original use case of WT_CURSOR.next(random) was to
          return a point in the tree for splitting the tree, and for that reason,
          once we found a random page, we always returned the first key on that
          page in order to make the split easy.

          In MongoDB: first, $sample de-duplicates the keys WiredTiger returns,
          that is, it ignores keys it's already returned; second, $sample allows
          you to set the sample size. If you specify a sample size greater than
          the number of leaf pages in the table, the de-duplication code catches
          us because we can't return more unique keys than the number of leaf
          pages in the table.

          Remove the code that returns the first key of the page, always return
          as a random a key as we can.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/ba9fcca4b317965b590ce4e67442f1a68a218bbe

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'} Message: SERVER-20385 : the original use case of WT_CURSOR.next(random) was to return a point in the tree for splitting the tree, and for that reason, once we found a random page, we always returned the first key on that page in order to make the split easy. In MongoDB: first, $sample de-duplicates the keys WiredTiger returns, that is, it ignores keys it's already returned; second, $sample allows you to set the sample size. If you specify a sample size greater than the number of leaf pages in the table, the de-duplication code catches us because we can't return more unique keys than the number of leaf pages in the table. Remove the code that returns the first key of the page, always return as a random a key as we can. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/ba9fcca4b317965b590ce4e67442f1a68a218bbe
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

          Message: Merge pull request #2194 from wiredtiger/server-20385

          SERVER-20385: WT_CURSOR.next(random) more random
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/7505a02a52bc140acd0fcd81985c0e0ad2a78f7d

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'} Message: Merge pull request #2194 from wiredtiger/server-20385 SERVER-20385 : WT_CURSOR.next(random) more random Branch: develop https://github.com/wiredtiger/wiredtiger/commit/7505a02a52bc140acd0fcd81985c0e0ad2a78f7d
          Hide
          matt.kangas Matt Kangas (Inactive) added a comment -

          Confirmed fixed per the repro above. Thanks!

          Show
          matt.kangas Matt Kangas (Inactive) added a comment - Confirmed fixed per the repro above. Thanks!

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                  Agile