[SERVER-39057] Add distance expressions for image feature comparison Created: 16/Jan/19  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Kelsey Schubert Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 1
Labels: pull-request
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Assigned Teams:
Query Optimization
Participants:

 Description   

This ticket tracks the work contained in Pull Request #1291.

We added these expressions:

'$cossim', '$chi2', '$euclidean', '$squared_euclidean', '$manhattan'

Which allow us to compare long vectors (image features) stored as arrays or BSON.
It is useful to find the most similar images in a dataset. The usage is the following:

db.test_speed.aggregate([
    {   
        '$project':
        {
            'id': '$id',
            "other_id": '$other_id',
            'distance': {'$cossim': [vector, '$vector']},
        },
    },
    {"$sort": {"distance": -1}},
    {"$limit": 20}
])

In addition implementations using avx2 and avx512 are included in this pull request.



 Comments   
Comment by Asya Kamsky [ 20/Sep/19 ]

Marc,

I'm sorry about the long delay to let you know that unfortunately, we will not be able to accept this pull request.

I'd like to outline a few reasons why this can't be merged:

The PR proposes several generic operations for computing the Euclidean, squared Euclidean, cosine similarity, Chi-squared and Manhattan distances between two N-dimensional vectors. Adding these particular vector operations would invariably produce subsequent requests to backfill more basic operations (vector addition, scalar times vector, dot product, etc.) Just considering distance measures, why those four in particular? SciPy provides a couple dozen, and OpenCV provides 4, albeit a different 4.

These types of functions might be a great addition to enhancing our analytics capabilities, but we feel it should only be done as part of a broader effort to add more computational operations. This should be probably spec'ed out as part of full slate of related functions, e.g. numeric vector and matrix operations.

Another related concern is about implementation for the new expressions – it is somewhat non-standard relative to existing expressions; for instance, the vectors themselves are passed in as raw float* with a separate parameter to indicate their length. In fact, it is likely that we would want to add something even for simple vectors that considers best storage format, possibly a new type of arrays that contain a single type, which is tracked in SERVER-9380.

Again, I apologize that it took me so long to get back to you on this, and thank you for your interest in contributing to MongoDB!

Asya Kamsky

Generated at Thu Feb 08 04:50:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.