0008. LLM and Embedding Provider Factory¶
Date: 2026-04-19
Status:¶
Accepted
Context¶
The Movie Finder project relies heavily on Large Language Models (LLMs) and embedding models for its core functionality across two primary sub-repositories:
1. backend/chain: The FastAPI runtime executing the LangGraph agent pipeline.
2. rag/: The offline ingestion pipeline responsible for vectorizing the dataset.
Initially, both repositories hardcoded dependencies on paid frontier models (Anthropic's claude-haiku / claude-sonnet for the chain, and OpenAI's text-embedding-3-large for embeddings).
During development, relying exclusively on these paid APIs incurs significant cost and introduces rate-limiting bottlenecks (e.g., TPM/RPM limits during intensive RAG testing or automated QA). Furthermore, developers with capable local hardware (e.g., GPUs with 12GB+ VRAM) had no seamless way to offload inference to free local models (like Ollama, vLLM, or SentenceTransformers) without rewriting application code.
Decision¶
We are adopting a unified Provider Factory Pattern across the entire project ecosystem.
- Environment-Driven Instantiation:
Code must never hardcode a
ChatAnthropicorOpenAIclient directly within node logic. Instead, instantiation is abstracted behind factory functions (e.g.,get_reasoning_llm(),get_embedding_model()). These factories read dedicated environment variables to determine the provider: ${NODE}_PROVIDER(e.g.,CLASSIFIER_PROVIDER="ollama",EMBEDDING_PROVIDER="huggingface")-
${NODE}_MODEL(e.g.,CLASSIFIER_MODEL="llama4-8b",EMBEDDING_MODEL="BAAI/bge-m3") -
Strict Pydantic Validation: The
ChainConfig(and equivalent RAG config) must validate${NODE}_PROVIDERagainst a strictLiteralwhitelist (e.g.,"anthropic", "openai", "groq", "together", "ollama", "google", "huggingface") to ensure fail-fast behaviour at startup. -
Singleton Caching: To prevent connection pool exhaustion and redundant initialization, factory functions must be decorated with
@lru_cache(maxsize=1). -
Zero-Collision Vector Target Naming: Because different embedding models produce vectors of different dimensions (and even models of the same dimension have incompatible vector spaces), the vector collection/table/namespace name will no longer be static. Both the RAG ingestion pipeline and the backend runtime MUST dynamically resolve the vector target using the format:
{VECTOR_COLLECTION_PREFIX}_{sanitized_model_name}_{dimension}(Example:movies_bge_m3_1024ormovies_text_embedding_3_large_3072). -
Docker Image Optimization (Optional Dependencies): To prevent bloating the production Docker images, only the default/compatibility SDKs (
langchain-anthropic,langchain-openai) will remain in the coredependencies. Heavy or alternative SDKs (langchain-google-genai,sentence-transformers,torch) will be declared in[project.optional-dependencies](e.g.,providersorlocal). Dockerfiles will use build arguments to conditionally install these groups during development builds.
Consequences¶
Positive:
- Zero-Cost Development: Developers can run the entire stack locally using Ollama and CPU-based embeddings (like BGE-M3).
- Agility: Switching from Anthropic to Groq or Google Gemini requires zero code changes, only .env updates.
- Safety: The dynamic collection naming completely eliminates the risk of dimension mismatch errors or corrupting an existing vector space when testing new embedding models.
- Image Size: Production Docker images remain lean by excluding massive local ML libraries unless explicitly requested.
Negative: - Complexity: The configuration schema is more verbose. Developers must ensure they have the correct optional dependencies installed if they choose an alternative provider. - Coordination: The backend and the RAG ingestion pipelines must maintain strict parity on how they sanitize model names to generate the vector target suffix, otherwise the backend will query a non-existent target.