Loading...

Type: Task
Resolution: Unresolved
Priority: Unknown
Fix Version/s: None
Affects Version/s: None
Component/s: AI/ML, LangChain
Labels:
None

Quarter:
- FY27Q3-candidate
Confidence Status:
None

Assigned Teams:

Python Drivers

Documentation Changes Summary:

Hide

1. What would you like to communicate to the user about this feature?
2. Would you like the user to see examples of the syntax and/or executable code and its output?
3. Which versions of the driver/connector does this apply to?

Show
1. What would you like to communicate to the user about this feature? 2. Would you like the user to see examples of the syntax and/or executable code and its output? 3. Which versions of the driver/connector does this apply to?

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

Context

We got a feedback from the customer that the HybridSearchRetriever is not implementing best practices.

Your earlier information helped us analyze the high memory consumption during search . You pointed out that we are using $match after $search, so we were not modifying anything from our end but passing the required parameters to MongoDBAtlasHybridSearchRetriever class. The query was built directly in MongoDB Atlas hybrid search retriever, and we had no control over that. I tried updating or passing a few more parameters in the kwargs and pre-filter, but they did not appear in the actual query, and there was no way to modify the existing behavior of the hybrid search retriever.To address this, I implemented a new custom search retriever inheriting the MongoDBAtlasHybridSearchRetriever class. Now we use compound.filter in $search and removed $match. With this new implementation, I can see that bytesRead has been significantly reduced. I am sharing the pipeline in the attached text file. Also, I made changes to the search index and am sharing the same with you.

Additional context:

I think the way the library does it is always using $match for the text search pipeline because it's easier for them to write the MQL once:

param post_filter: List[Dict[str, Any]] | None = None
(Optional) Pipeline of MongoDB aggregation stages for postprocessing.
param pre_filter: Dict[str, Any] | None = None
(Optional) Any MQL match expression [emphasis mine] comparing an indexed field

So yes, there would be a difference for the $vectorSearch pipeline, but for $search it's always going to end up with a $match stage

Slack discussion: (may no longer be available.) https://mongodb.slack.com/archives/C050RFSKQF7/p1761562852834629

Further Refinement

The customer has done a good job at describing how a user of HybridSearchRetriever would extend the filter to get more performant results for the text search stage. The current implementation was meant to be completely general. In practice, one would know what they want, and extending the class is what I expected users to do. In this case, I think we could add a kwarg called perhaps search_filter that is performed as an atlas compound.filter within the $search stage. For these, one would also need to create an index on the fields to filter.

The additional code would look something like so, adding this to {{ search_stage["$search"]}}


    if search_filter:
        # Use compound with must + filter
        if not isinstance(search_filter, list):
            search_filter = [search_filter]

        search_stage["$search"]["compound"] = {
            "must": [text_clause],
            "filter": search_filter,
        }
    else:
        # Simple text query without compound
        search_stage["$search"].update(text_clause)

This does not introduce a breaking change. I recommend updating the documentation to use the search_filter whenever one is able to.

Definition of Done

Add the new compound.filter logic described above to the Hybrid AND Text SearchRetrievers
Document that using a filter within $search is the best practice, and to avoid using search as it uses the very $match in the query. Link to our documentation on compound.filter and indexes.

Pitfalls

If possible, it would be great to infer, based on the search_filter kwarg, what indexes are required and automatically create them if they do not exist. In practice, we may have to live with good documentation of the process.

blocks

INTPYTHON-752 Test pymongo-search-utils with langchain-mongodb

Closed

is blocked by

INTPYTHON-681 Rewrite HybridSearch retriever(s) to use new $rankFusion operator

Backlog

Details

Description

Context

So yes, there would be a difference for the $vectorSearch pipeline, but for $search it's always going to end up with a $match stage

Further Refinement

Definition of Done

Pitfalls

Attachments

Issue Links

Activity

People

Dates