[DOCS-9120] Explain that using the aggregation pipeline instead of collection.distinct() is a better equivalent to SQL "SELECT DISTINCT x FROM .." Created: 09/Oct/16  Updated: 30/Oct/23  Resolved: 31/Jan/18

Status: Closed
Project: Documentation
Component/s: manual
Affects Version/s: None
Fix Version/s: Server_Docs_20231030

Type: Task Priority: Major - P3
Reporter: Akira Kurogane Assignee: Allison Reinheimer Moore
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:
Days since reply: 6 years, 2 weeks ago
Story Points: 0.25

 Description   

A customer who didn't know about or aggregation pipelines, or knew about it but maybe saw it as being too powerful, was trying to do the equivalent of "SELECT DISTINCT field1 FROM ...".

The documentation he used was https://docs.mongodb.com/manual/reference/method/db.collection.distinct/. I suspect he may have also read https://docs.mongodb.com/manual/reference/sql-comparison/, which gives the example of "SELECT DISTINCT(status) FROM users" == "db.users.distinct( "status" )"

He ran into the problem that he had hundreds of thousands of distinct values in the collection he was examining and this lead to "BSONObj size: 25448908 (0x18451CC) is invalid. Size must be between 0 and 16793600(16MB)" when he tried to use the distinct command. That is the distinct values weren't a small set, so the result size exceeded 16MB, so the packing of array result value threw an exception when it reached max BSON object size.

Neither the db.collection.distinct page or https://docs.mongodb.com/manual/reference/command/distinct/#dbcmd.distinct for the underlying db command highlight that the single array result means a 16mb result size limit. Nor do they explain to avoid that and get a cursor instead (like a find command, which is the equivalent of the SQL SELECT command) you should use an aggregation pipeline per the example at https://docs.mongodb.com/manual/reference/operator/aggregation/group/#retrieve-distinct-values.

I think it's time we start describing the distinct command / wrapper method as a convenience method with limited result size and point to the aggregation pipeline method ( "db.myCollection.aggregate( [ { $group : { _id : "$myField" } } ] )" ) as the fundamental way of fetching distinct values.



 Comments   
Comment by Githook User [ 01/Feb/18 ]

Author:

{'email': 'allison.moore@10gen.com', 'name': 'Allison Moore', 'username': 'schmalliso'}

Message: DOCS-9120: emphasize BSON limit on distinct, propose agg alternative
Branch: master
https://github.com/mongodb/docs/commit/00a3e5b99ccc47afced56c65685a087e7349b6eb

Comment by Akira Kurogane [ 30/Jan/18 ]

Thanks Allison. Please excuse the nitpicking. Code review LGTM'ed just now.

Comment by Akira Kurogane [ 26/Jan/18 ]

Hi allison.moore. Code review update for https://mongodbcr.appspot.com/177940001/ made.

Generated at Thu Feb 08 07:57:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.