[LangChain] Update the hybrid search retriever with best practices

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Unknown
    • None
    • Affects Version/s: None
    • Component/s: AI/ML, LangChain
    • None
    • Python Drivers
    • Hide

      1. What would you like to communicate to the user about this feature?
      2. Would you like the user to see examples of the syntax and/or executable code and its output?
      3. Which versions of the driver/connector does this apply to?

      Show
      1. What would you like to communicate to the user about this feature? 2. Would you like the user to see examples of the syntax and/or executable code and its output? 3. Which versions of the driver/connector does this apply to?
    • None
    • None
    • None
    • None
    • None
    • None

      Context

      We got a feedback from the customer that the HybridSearchRetriever is not implementing best practices.

      Your earlier information helped us analyze the high memory consumption during search . You pointed out that we are using $match after $search, so we were not modifying anything from our end but passing the required parameters to MongoDBAtlasHybridSearchRetriever class. The query was built directly in MongoDB Atlas hybrid search retriever, and we had no control over that. I tried updating or passing a few more parameters in the kwargs and pre-filter, but they did not appear in the actual query, and there was no way to modify the existing behavior of the hybrid search retriever.To address this, I implemented a new custom search retriever inheriting the MongoDBAtlasHybridSearchRetriever classNow we use compound.filter in $search and removed $match. With this new implementation, I can see that bytesRead has been significantly reduced. I am sharing the pipeline in the attached text file. Also, I made changes to the search index and am sharing the same with you.

       

      Additional context:
       
      I think the way the library does it is always using $match for the text search pipeline because it's easier for them to write the MQL once:

      param post_filter: List[Dict[str, Any]] | None = None
      (Optional) Pipeline of MongoDB aggregation stages for postprocessing.
      param pre_filter: Dict[str, Any] | None = None
      (Optional) Any MQL match expression [emphasis mine] comparing an indexed field

      So yes, there would be a difference for the $vectorSearch pipeline, but for $search it's always going to end up with a $match stage

       
      Slack discussion: (may no longer be available.) https://mongodb.slack.com/archives/C050RFSKQF7/p1761562852834629

      Further Refinement

      The customer has done a good job at describing how a user of HybridSearchRetriever would extend the filter to get more performant results for the text search stage. The current implementation was meant to be completely general. In practice, one would know what they want, and extending the class is what I expected users to do. In this case, I think we could add a kwarg called perhaps search_filter that is performed as an atlas compound.filter within the $search stage. For these, one would also need to create an index on the fields to filter.

      The additional code would look something like so, adding this to {{ search_stage["$search"]}}

      
          if search_filter:
              # Use compound with must + filter
              if not isinstance(search_filter, list):
                  search_filter = [search_filter]
      
              search_stage["$search"]["compound"] = {
                  "must": [text_clause],
                  "filter": search_filter,
              }
          else:
              # Simple text query without compound
              search_stage["$search"].update(text_clause)
      

      This does not introduce a breaking change. I recommend updating the documentation to use the search_filter whenever one is able to.

      Definition of Done

      • Add the new compound.filter logic described above to the Hybrid AND Text SearchRetrievers
      • Document that using a filter within $search is the best practice, and to avoid using search as it uses the very $match in the query. Link to our documentation on compound.filter and indexes.

      Pitfalls

      If possible, it would be great to infer, based on the search_filter kwarg, what indexes are required and automatically create them if they do not exist. In practice, we may have to live with good documentation of the process.

            Assignee:
            Unassigned
            Reporter:
            Prakul Agarwal
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: