-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Unknown
-
None
-
Affects Version/s: None
-
None
Context
We got a feedback from the customer that the HybridSearchRetriever is not implementing best practices.
Your earlier information helped us analyze the high memory consumption during search . You pointed out that we are using $match after $search, so we were not modifying anything from our end but passing the required parameters to MongoDBAtlasHybridSearchRetriever class. The query was built directly in MongoDB Atlas hybrid search retriever, and we had no control over that. I tried updating or passing a few more parameters in the kwargs and pre-filter, but they did not appear in the actual query, and there was no way to modify the existing behavior of the hybrid search retriever.To address this, I implemented a new custom search retriever inheriting the MongoDBAtlasHybridSearchRetriever class. Now we use compound.filter in $search and removed $match. With this new implementation, I can see that bytesRead has been significantly reduced. I am sharing the pipeline in the attached text file. Also, I made changes to the search index and am sharing the same with you.
Additional context:
I think the way the library does it is always using $match for the text search pipeline because it's easier for them to write the MQL once:
param post_filter: List[Dict[str, Any]] | None = None
(Optional) Pipeline of MongoDB aggregation stages for postprocessing.
param pre_filter: Dict[str, Any] | None = None
(Optional) Any MQL match expression [emphasis mine] comparing an indexed field
So yes, there would be a difference for the $vectorSearch pipeline, but for $search it's always going to end up with a $match stage
Slack discussion: (may no longer be available.) https://mongodb.slack.com/archives/C050RFSKQF7/p1761562852834629
Further Refinement
The customer has done a good job at describing how a user of HybridSearchRetriever would extend the filter to get more performant results for the text search stage. The current implementation was meant to be completely general. In practice, one would know what they want, and extending the class is what I expected users to do. In this case, I think we could add a kwarg called perhaps search_filter that is performed as an atlas compound.filter within the $search stage. For these, one would also need to create an index on the fields to filter.
The additional code would look something like so, adding this to {{ search_stage["$search"]}}
if search_filter:
# Use compound with must + filter
if not isinstance(search_filter, list):
search_filter = [search_filter]
search_stage["$search"]["compound"] = {
"must": [text_clause],
"filter": search_filter,
}
else:
# Simple text query without compound
search_stage["$search"].update(text_clause)
This does not introduce a breaking change. I recommend updating the documentation to use the search_filter whenever one is able to.
Definition of Done
- Add the new compound.filter logic described above to the Hybrid AND Text SearchRetrievers
- Document that using a filter within $search is the best practice, and to avoid using search as it uses the very $match in the query. Link to our documentation on compound.filter and indexes.
Pitfalls
If possible, it would be great to infer, based on the search_filter kwarg, what indexes are required and automatically create them if they do not exist. In practice, we may have to live with good documentation of the process.
- blocks
-
INTPYTHON-752 Test pymongo-search-utils with langchain-mongodb
-
- Closed
-
- is blocked by
-
INTPYTHON-681 Rewrite HybridSearch retriever(s) to use new $rankFusion operator
-
- Backlog
-