Uploaded image for project: 'Documentation'
  1. Documentation
  2. DOCS-9120

Explain that using the aggregation pipeline instead of collection.distinct() is a better equivalent to SQL "SELECT DISTINCT x FROM .."

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: manual
    • Labels:
      None
    • # Replies:
      3
    • Last comment by Customer:
      true
    • Story Points:
      0.25
    • Sprint:
      KANBAN BUCKET

      Description

      A customer who didn't know about or aggregation pipelines, or knew about it but maybe saw it as being too powerful, was trying to do the equivalent of "SELECT DISTINCT field1 FROM ...".

      The documentation he used was https://docs.mongodb.com/manual/reference/method/db.collection.distinct/. I suspect he may have also read https://docs.mongodb.com/manual/reference/sql-comparison/, which gives the example of "SELECT DISTINCT(status) FROM users" == "db.users.distinct( "status" )"

      He ran into the problem that he had hundreds of thousands of distinct values in the collection he was examining and this lead to "BSONObj size: 25448908 (0x18451CC) is invalid. Size must be between 0 and 16793600(16MB)" when he tried to use the distinct command. That is the distinct values weren't a small set, so the result size exceeded 16MB, so the packing of array result value threw an exception when it reached max BSON object size.

      Neither the db.collection.distinct page or https://docs.mongodb.com/manual/reference/command/distinct/#dbcmd.distinct for the underlying db command highlight that the single array result means a 16mb result size limit. Nor do they explain to avoid that and get a cursor instead (like a find command, which is the equivalent of the SQL SELECT command) you should use an aggregation pipeline per the example at https://docs.mongodb.com/manual/reference/operator/aggregation/group/#retrieve-distinct-values.

      I think it's time we start describing the distinct command / wrapper method as a convenience method with limited result size and point to the aggregation pipeline method ( "db.myCollection.aggregate( [ { $group : { _id : "$myField" } } ] )" ) as the fundamental way of fetching distinct values.

        Attachments

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                1 year, 50 weeks, 4 days ago
                Date of 1st Reply: