Update the LangChain hybrid search retriever with best practises

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Unknown
    • None
    • Affects Version/s: None
    • Component/s: LangChain
    • None
    • Python Drivers
    • Hide

      1. What would you like to communicate to the user about this feature?
      2. Would you like the user to see examples of the syntax and/or executable code and its output?
      3. Which versions of the driver/connector does this apply to?

      Show
      1. What would you like to communicate to the user about this feature? 2. Would you like the user to see examples of the syntax and/or executable code and its output? 3. Which versions of the driver/connector does this apply to?
    • None
    • None
    • None
    • None
    • None
    • None

      We got a feedback from the customer that the HybridSearchRetriever is not implementing best practices.

      Your earlier information helped us analyze the high memory consumption during search . You pointed out that we are using $match after $search, so we were not modifying anything from our end but passing the required parameters to MongoDBAtlasHybridSearchRetriever class. The query was built directly in MongoDB Atlas hybrid search retriever, and we had no control over that. I tried updating or passing a few more parameters in the kwargs and pre-filter, but they did not appear in the actual query, and there was no way to modify the existing behavior of the hybrid search retriever.To address this, I implemented a new custom search retriever inheriting the MongoDBAtlasHybridSearchRetriever classNow we use compound.filter in $search and removed $match. With this new implementation, I can see that bytesRead has been significantly reduced. I am sharing the pipeline in the attached text file. Also, I made changes to the search index and am sharing the same with you.

       

      Additional context:
       
      I think the way the library does it is always using $match for the text search pipeline because it's easier for them to write the MQL once:

      param post_filter: List[Dict[str, Any]] | None = None
      (Optional) Pipeline of MongoDB aggregation stages for postprocessing.
      param pre_filter: Dict[str, Any] | None = None
      (Optional) Any MQL match expression [emphasis mine] comparing an indexed field

      So yes, there would be a difference for the $vectorSearch pipeline, but for $search it's always going to end up with a $match stage
       
      Slack discussion: https://mongodb.slack.com/archives/C050RFSKQF7/p1761562852834629

            Assignee:
            Unassigned
            Reporter:
            Prakul Agarwal
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: