[LangChain] Add Autoembedding to core VectorStore

XMLWordPrintableJSON

    • None
    • Python Drivers
    • Needed
    • Hide

      Documentation will need to be done, with details to be discussed.

      Show
      Documentation will need to be done, with details to be discussed.
    • None
    • None
    • None
    • None
    • None
    • None

      Context

      Introduce support for auto-embedding vector indexes in the MongoDB LangChain VectorStore integration, while preserving backward compatibility with existing manual-embedding workflows.

      This proposal outlines API changes and refactors required across the VectorStore and Retriever layers to support auto-embedded indexes backed by MongoDB Atlas Vector Search.

      For more details please see the Google document Design: Auto-Embedding and therein the link to the Syntax document.

      The current MongoDBAtlasVectorSearch implementation assumes that embedding vectors are always:

      • Generated client-side
      • Explicitly stored in a document field
      • Required at both insert and query time

      MongoDB Atlas now supports auto-embedded vector indexes, where:

      • Text is embedded server-side
      • No embedding vectors need to be generated or stored by the client

      Supporting this requires careful API evolution to avoid breaking existing users.

      VectorStore API Changes

      Constructor

      Current signature:

      def __init__(
      self,
      collection: Collection[Dict[str, Any]],
      embedding: Embeddings = AutoEmbedding(model_name="voyage-3.5"),
      index_name: str = "vector_index",
      text_key: Union[str, List[str]] = "text",
      embedding_key: str = "embedding",
      relevance_score_fn: str = "cosine",
      dimensions: int = -1,
      auto_create_index: bool | None = None,
      auto_index_timeout: int = 15,
      vector_index_options: dict | None = None,
      **kwargs: Any,
      ):
      

      Identified Issues

      • embedding is effectively required, even when embeddings should be auto-generated server-side
      • Parameters are not keyword-only, preventing reordering or default changes without breaking compatibility

      Proposed changes:

      New AutoEmbeddings Class

      Introduce a new Embeddings class. This maintains backward-compatibility of the API, and provides provides metadata required to create auto-embedded vector indexes.

      Internal Convenience Flag

      During initialization, define an internal boolean (derived from auto_embed) to:

      • Control whether embedding vectors are generated client-side
      • Enable clean branching in new code paths

      This should minimize code changes, even in our vector-based retrievers (Hybrid, Parent Document) that can use the flag to determine whether query text should be embedded locally.

      VectorStore insertion APIs

      add_texts / add_documents / bulk_embed_and_insert_texts

      Current behavior:

      • Embedding vectors are always computed client-side
      • Vectors are always stored in the document

      Auto-embedding behavior:

      • Skip client-side embedding entirely
      • Insert raw text only

      Proposed Changes

      • Refactor bulk_embed_and_insert_texts to support both modes
      • Preserve existing logic for manual embeddings
      • Bypass embedding logic when self.auto_embed=True

      The public API remains the same; behavior changes are driven by configuration.

      Vector Index Creation

      create_vector_search_index

      Current issues:

      • dimensions is a required argument
      • Not needed for auto-embedded indexes

      Proposed Changes

      • Change signature to:
      • dimensions: Optional[int] = -1
      • Add auto_embed=True support

      This method continues to delegate to pymongo-search-utils.

      Open question:

      • Can an existing index be updated from textautoEmbed?

      Similarity Search APIs

      Public Search Methods

      Public APIs accept raw query strings, which works well for both modes:

      • Manual embedding
      • Auto-embedding

      Internal Search Method

      Current internal method:

      • _similarity_search_with_score
      • Accepts query_vector: List[float]
      • Passes directly to pymongo-search-utils

      Proposed Changes

      • Introduce a new internal method:
      • _autoembedded_vector_search_stage
      • Accepts raw query text
      • Constructs the appropriate MongoDB search stage

      This avoids forcing auto-embedded workflows through a vector-based API.

      get_by_ids API

      Current behavior:

      • Explicitly deletes doc["embedding"]

      Required Changes

      • Refactor to conditionally handle:
      • Documents with embedding vectors
      • Documents without embedding vectors (auto-embedded)

      Retriever API Impact

      Relevant retrievers implement:

      • _get_relevant_documents

      Current behavior:

      • Always computes an embedding vector for the query text

      Proposed Changes

      • Use the VectorStore auto-embed flag to:
      • Embed queries client-side (manual mode)
      • Pass raw query text directly (auto-embed mode)

      This keeps retriever logic clean and avoids duplicated branching logic.

      Definition of Done

      • Existing manual-embedding workflows remain unchanged
      • Retrievers behave correctly in both modes
      • New basic testing is added but committed as pytest.skip. Testing can be done with a docker-compose approach until we have an automated setup in ai-ml-pipelines.

      Notes / Open Questions

      • Index migration feasibility: manual → autoEmbed
      • Validation of mixed-mode usage (multiple VectorStores)
      • Documentation updates required for new configuration options

            Assignee:
            Iris Ho
            Reporter:
            Casey Clements
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: