[SERVER-4608] aggregation: allow binary data to pass through pipelines Created: 03/Jan/12  Updated: 24/Mar/17  Resolved: 11/Dec/12

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: 2.3.2

Type: Bug Priority: Major - P3
Reporter: Daniel Pasette (Inactive) Assignee: Mathias Stearn
Resolution: Done Votes: 19
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-5718 Code and CodeWScope should be able to... Closed
is related to SERVER-4638 issue with certain data types? Closed
is related to SERVER-4644 aggregation: optimize memory utilitz... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Support pass-through, $sort, and $group on Binary fields



 Comments   
Comment by auto [ 28/May/13 ]

Author:

{u'username': u'asya999', u'name': u'Asya Kamsky', u'email': u'asya999@gmail.com'}

Message: 2.4 removed restriction on BINARY

See https://jira.mongodb.org/browse/SERVER-4608
Branch: master
https://github.com/mongodb/docs/commit/b477c883a87000b5ee6069e96de624ab8bff6030

Comment by auto [ 11/Dec/12 ]

Author:

{u'date': u'2012-11-29T19:54:48Z', u'email': u'mathias@10gen.com', u'name': u'Mathias Stearn'}

Message: Add at least minimal support for all types to agg

Minimal support means conversion to/from BSON, comparison and hashing.
This means that they can be passed through the pipeline correctly, used
in $sort, and used in _id expressions in $group.

SERVER-4608 - Binary pass through
SERVER-5718 - Code/CodeWScope pass through
SERVER-6470 - Don't convert Regex to String
SERVER-7185 - Symbol support
Branch: master
https://github.com/mongodb/mongo/commit/fefb4334afe40664438668a289c6daed6813b3c3

Comment by Mathias Stearn [ 03/Dec/12 ]

Updating ticket to reflect fix. All BSON types will be supported regardless of type or size.

Comment by auto [ 12/Jul/12 ]

Author:

{u'date': u'2012-06-29T16:49:56-07:00', u'email': u'mathias@10gen.com', u'name': u'Mathias Stearn'}

Message: If there is an early simple $project, apply it before converting to Documents SERVER-4644

This is a partial fix for SERVER-4644 in that it only works with an
explicit $project and only if that project is supported by the existing
Projection class used to implement the second argument to find().

This also provides a workaround for objects with types that aren't
supported by the Value class (SERVER-4608, SERVER-5718, and SERVER-4968.
Previous behavior was to assert with no possible workaround.

This commit will need some doc updates, in particular in the "Optimizing
Performance" section.
Branch: master
https://github.com/mongodb/mongo/commit/c62b02c1dbc95d0ed1a57231298aa2d81dd84c39

Comment by Chris Westin [ 01/Jun/12 ]

@Paul van Brouwershaven: Yes, and I'm currently working on SERVER-4644 to do exactly that.

Comment by Paul van Brouwershaven [ 01/Jun/12 ]

The problem is that you can't use the aggregation for a collection that contains only a few binary objects. I'm not interested in the binary object for the aggregation, I just want to use the $group functionality and can't simply delete these binary objects from the document collection.

In a simple group aggregation you will only use a count and an identifier object, for this query you will not interested in any other objects. The binary object should only be a problem if it would be your identifier (group by) or if you want to do something else with it.

Probably I'm thinking to simple but should fields that are not used in a query not be ignored?

Comment by Chris Westin [ 27/Apr/12 ]

@Victor Kabdebon: Thanks, I'll take a look at what you've got. Right now we're trying to lock down 2.2, so I'm not sure if this will make it in or not, but we should have something soon. We may also rely on a combination of features such as those discussed above.

Comment by Chris Westin [ 27/Apr/12 ]

@Mathias: separate ticket please, marked related. I suspect I'm more likely to rely on SERVER-4644 or the dummy value solution for longs, which would be different than what you're suggesting. So let's keep them separate.

Comment by Mathias Stearn [ 25/Apr/12 ]

There is also an issue with functions (codeWScope to be specific). I ran into it while trying to run a pivot aggregation on a sampling of db.currentOp() runs. It would be nice if you could pass small functions through, or perhaps limit it to just the function name and signature. Do you want me to make a separate ticket for that or would it be handled the same as this one?

Comment by Victor Kabdebon [ 23/Apr/12 ]

@Chris: Hi Chris, playing with local information such as the Subtype I wrote a temporary fix for this problem and make it as a pull request on github (see [1]). The problem is that all the clients I am using : C# and Python convert to a binary array any UUID that is given to them. UUID is an identifier standard and is used everywhere and prevents the use of pipeline everywhere.
Can't we leave to the user the choice on the pipeline for what he wants to do with BinData: Safe mode where all BinData are replaced by dummy values and Unsafe where MongoDB tries to find the best policy possible based on local information, with no guarantee of safety?

[1]My attempt to fix this is located here:
https://github.com/mongodb/mongo/pull/212

Best.

Comment by Chris Westin [ 22/Mar/12 ]

@Mathias: in this case, subtype is serving as a rough proxy for size. We can't just pass or not pass documents because of their size, because this would give seemingly random and incorrect results. We have to have some kind of rule to either always do it, or never do it, depending on the locally available information.

Given the schema-less nature of MongoDB, I suppose that the subtypes could vary from document to document anyway, and give the same (random, incorrect) result.

I'm increasingly liking your other suggestion of using a dummy value that causes errors if it is referenced or makes it all the way to the end of the pipeline. That may be the best way out of this, other than SERVER-4644, but it may be messy handling that in a bunch of places. I'll think about that some more.

Comment by Paul Sanchez [ 21/Mar/12 ]

I suppose either allowing Subtypes 3, 4, and 5, or anything that is either up to or exactly 16 bytes, or hell even a combination of both, would work for me.

Comment by Mathias Stearn [ 21/Mar/12 ]

If you are going to do this, it should be based on size, not subtype. There is no guarantee that anything with a UUID subtype must be exactly 16 bytes. Equivalently there is no good reason not to pass a 4-byte binary string through.

Comment by Chris Westin [ 01/Mar/12 ]

I've seen a few reports on GG of this being a problem for folks trying to pass UUIDs and MD5s through pipelines in order to get their primary keys out at the other end, as per FREE-5540.

I disagree about grouping on any binary type, because the unbounded ones will consume a lot of memory to pass through the pipelines. However, we should at least support the smaller bounded types described above in the near-term.

Comment by auto [ 06/Jan/12 ]

Author:

{u'login': u'cwestin', u'name': u'U-tellus\\cwestin', u'email': u'cwestin@10gen.com'}

Message: prep for SERVER-4608
Branch: master
https://github.com/mongodb/mongo/commit/59c3247ac1da22024edfcb784e9224bf03ac0e5c

Comment by Eliot Horowitz (Inactive) [ 04/Jan/12 ]

There is no reason we shouldn't support group on any bindata type.

Comment by Chris Westin [ 03/Jan/12 ]

Suggested by Scott.

Generated at Thu Feb 08 03:06:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.