-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Unknown
-
None
-
Affects Version/s: None
-
Component/s: LangChain
-
None
We got a feedback from the customer that the HybridSearchRetriever is not implementing best practices.
Your earlier information helped us analyze the high memory consumption during search . You pointed out that we are using $match after $search, so we were not modifying anything from our end but passing the required parameters to MongoDBAtlasHybridSearchRetriever class. The query was built directly in MongoDB Atlas hybrid search retriever, and we had no control over that. I tried updating or passing a few more parameters in the kwargs and pre-filter, but they did not appear in the actual query, and there was no way to modify the existing behavior of the hybrid search retriever.To address this, I implemented a new custom search retriever inheriting the MongoDBAtlasHybridSearchRetriever class. Now we use compound.filter in $search and removed $match. With this new implementation, I can see that bytesRead has been significantly reduced. I am sharing the pipeline in the attached text file. Also, I made changes to the search index and am sharing the same with you.
Additional context:
I think the way the library does it is always using $match for the text search pipeline because it's easier for them to write the MQL once:
param post_filter: List[Dict[str, Any]] | None = None
(Optional) Pipeline of MongoDB aggregation stages for postprocessing.
param pre_filter: Dict[str, Any] | None = None
(Optional) Any MQL match expression [emphasis mine] comparing an indexed field
So yes, there would be a difference for the $vectorSearch pipeline, but for $search it's always going to end up with a $match stage
Slack discussion: https://mongodb.slack.com/archives/C050RFSKQF7/p1761562852834629
- blocks
-
INTPYTHON-752 Test pymongo-search-utils with langchain-mongodb
-
- In Progress
-