[SERVER-30359] Add a more generalized $hash expression Created: 26/Jul/17  Updated: 29/Nov/23

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Asya Kamsky Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 16
Labels: BIC, expression, pm1457-nominee
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-69128 Server-side document hashing Closed
is related to SERVER-49214 Add $toHashedIndexKey expression Closed
Assigned Teams:
Query Optimization
Sprint: Query 2019-08-12
Participants:
Case:

 Description   

We added $toHashedIndexKey in SERVER-49214 which solves some similar use cases, and some described below in the original description and comments. This ticket remains open to add a more general hash expression - perhaps for more cryptographic use cases, or if someone wants/needs a particular algorithm.

Original Description

Oracle has http://docs.oracle.com/database/121/SQLRF/functions183.htm#SQLRF55647 which computes one of several standard hash functions on a particular column.

It would be nice if there was analogous expression in agg:

hashMD5:{$hash:{source:"$expression", function:"MD5"}}

or something like that.



 Comments   
Comment by Craven Huynh [ 29/Nov/23 ]

I am not very familiar with the server internals, but I think the difference between obtaining the hash of a document's raw bson vs a document's field is that the former would hash the bytes before they are marshaled into a C++ bson struct. The latter would hash a specific field of the already-marshaled C++ bson struct at which point we may no longer have access to the raw bson bytes.

I would like a hash of the raw bson bytes. This hash would be used as the checksum for a document. The server already provides a dbHash function that returns the MD5 checksums of each collection within a database, I would like a hash function that returns the MD5 checksum of a document. More specifically, I will use this function to get the individual MD5 of each document within a collection.

Comment by Asya Kamsky [ 29/Nov/23 ]

Is there a difference since a fiend can be a subdocument aka object?

Comment by Craven Huynh [ 29/Nov/23 ]

Is the ask of this ticket strictly related to obtaining arbitrary $hash of certain fields or does it also encompass the md5 hash of the raw bson of a document?

Comment by Oleksii Petrov [ 15/Dec/21 ]

That'd a great feature if added. Currently, also working on a case where I would really appreciate a $hash projection of a compound $group key. 

Comment by Christian Kurze (Inactive) [ 30/Aug/19 ]

asya Yes, technically speaking they are different. Semantically, they are the same. Can we use all the expressions of $project/$addFields? Then the different objects can be transformed in the correct same structure.

Comment by Asya Kamsky [ 30/Aug/19 ]

christian.kurze but {a:1,b:2} and {b:2,a:1} don't compare as equal as sub-objects, why would you expect them to get the same hash?

Comment by Christian Kurze (Inactive) [ 28/Aug/19 ]

This will also help in creating hashes to identify changed data (i.e. hash the values of a (sub)document and identify if it has changed) - needed for use cases where we want to store the history and not compare documents in the application (which is expensive to transfer data to the application and requires a lot of application code).

We need to be careful in case of subdocuments so that

{ a: 1, b: 2 }

gets the same hash as as

{ b:2, a: 1}

. The expression provided as "source" attribute can take care of proper ordering or creating a concatenated string for the use case of keeping historic data and versions.

Comment by Pierre Bazoge [ 24/Jan/19 ]

This is a feature I really need at the moment, in an aggregation pipe I use a computed key as a $group _id, and the key could be very long, a hash would fix that.

Thanks

Comment by Asya Kamsky [ 28/Oct/18 ]

I don't see a reason not to allow that, in fact, that was one of the asks by someone who basically wanted to hash full document content and then store result in the document (so later it can be tested if the document content has been changed at all).

 

Comment by Patrick Meredith [ 27/Oct/18 ]

Also, do we expect to be able to hash arrays and subdocuments? I'm going to go ahead and do it that way.

Comment by Patrick Meredith [ 22/Oct/18 ]

Expected algorithms are SHA1, SHA256, SHA384, SHA512, and MD5, any others?

Generated at Thu Feb 08 04:23:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.