Movie Finder RAG¶

Offline ingestion pipeline for Movie Finder. It downloads the Wikipedia Movie Plots dataset, generates embeddings through the ADR 0008 provider factory, and writes vectors into the configured store.

Contributor workflow in this repo is Docker-only. Lint, tests, type checks, ingestion, validation, and backup all run through the committed Makefile. Host-managed Python environments are not part of the supported developer workflow.

Overview¶

Kaggle CSV -> CSV Loader -> Embedding Provider Factory -> Vector Store Factory
               (pandas)      (OpenAI/Ollama/HuggingFace/Google) (Qdrant/Chroma/Pinecone/PGVector)

ADR 0008 contract:

Embedding runtime is environment-driven via EMBEDDING_PROVIDER and EMBEDDING_MODEL.
Heavy local SDKs are installed only when WITH_PROVIDERS=local.
Final target name is resolved as {VECTOR_COLLECTION_PREFIX}_{sanitized_model}_{dimension}.
Query-time and ingestion-time embeddings must still be coordinated with chain/.

Local Workflow¶

Prerequisites¶

Docker 24+ with the Compose plugin
GNU Make
Kaggle API token (KGAT_...)
Provider credentials or a local runtime, depending on EMBEDDING_PROVIDER
Vector store credentials for the selected VECTOR_STORE

Setup¶

make init
make editor-up

If you want local HuggingFace / sentence-transformers support, set WITH_PROVIDERS=local before building:

WITH_PROVIDERS=local make init

Common Commands¶

make init
make editor-up
make editor-down
make shell

make lint
make fix
make format
make typecheck
make test
make test-coverage
make pre-commit
make check

make ingest
make validate
make backup
make retrieve

Ingestion¶

make ingest

make ingest runs the runtime image against the configured vector store using variables from .env. For remote backends, credentials stay external to the repo. For chromadb, persistence is through the mounted workspace path inside Docker.

Validation¶

make validate

make validate performs a post-ingestion smoke check using the configured provider and vector store. It is intended to confirm that the freshly written target can be queried end-to-end, not to replace a broader retrieval evaluation suite.

Backup¶

make backup

make backup runs inside Docker and exports the configured source backend into a portable ChromaDB artifact. This lets CI archive one stable backup format even when the source backend is qdrant, chromadb, pinecone, or pgvector.

Provider Configuration¶

The canonical embedding settings are:

Variable	Required	Description
`EMBEDDING_PROVIDER`	Yes	`openai`, `ollama`, `huggingface`, `sentence-transformers`, or `google`
`EMBEDDING_MODEL`	Yes	Model identifier for the selected provider
`EMBEDDING_DIMENSION`	No	Optional dimension override for providers that support it
`WITH_PROVIDERS`	No	Docker build extra, usually `local` for sentence-transformers + torch

Provider-specific credentials:

Variable	Required When	Description
`OPENAI_API_KEY`	`EMBEDDING_PROVIDER=openai`	OpenAI embeddings API key
`GOOGLE_API_KEY`	`EMBEDDING_PROVIDER=google`	Google embeddings API key
`OLLAMA_BASE_URL`	`EMBEDDING_PROVIDER=ollama`	Ollama base URL, local or cloud
`OLLAMA_API_KEY`	Optional	Required for Ollama cloud
`SENTENCE_TRANSFORMERS_CACHE_DIR`	Optional	Cache directory for local HuggingFace models

Vector Store Configuration¶

Variable	Required	Description
`VECTOR_STORE`	Yes	`qdrant`, `chromadb`, `pinecone`, or `pgvector`
`VECTOR_COLLECTION_PREFIX`	Yes	Prefix used for ADR 0008 target naming
`QDRANT_URL`	If qdrant	Qdrant endpoint
`QDRANT_API_KEY_RW`	If qdrant	Qdrant write-capable key
`PINECONE_INDEX_HOST`	If pinecone	Pinecone host
`PINECONE_API_KEY`	If pinecone	Pinecone API key
`PGVECTOR_DSN`	If pgvector	pgvector PostgreSQL DSN

Backend-specific settings:

Variable	Required When	Description
`QDRANT_URL`	`VECTOR_STORE=qdrant`	Qdrant cluster URL
`QDRANT_API_KEY_RW`	`VECTOR_STORE=qdrant`	Write-capable Qdrant API key
`CHROMADB_PERSIST_PATH`	`VECTOR_STORE=chromadb`	Local persistence path inside Docker, default `outputs/chromadb/local`
`PINECONE_API_KEY`	`VECTOR_STORE=pinecone`	Pinecone API key
`PINECONE_INDEX_NAME`	`VECTOR_STORE=pinecone`	Pinecone index name
`PINECONE_INDEX_HOST`	Optional	Existing Pinecone host override
`PINECONE_CLOUD` / `PINECONE_REGION`	Optional	Serverless creation settings
`PGVECTOR_DSN`	`VECTOR_STORE=pgvector`	PostgreSQL DSN with pgvector enabled
`PGVECTOR_SCHEMA`	Optional	Schema token used for pgvector target names

Ingestion Outputs¶

make ingest now emits:

ingestion-outputs.env
outputs/reports/cost-report.json
outputs/reports/skipped-movies.json

make validate emits:

outputs/reports/validation-report.json

make backup emits:

outputs/backups/**
outputs/backups/**/backup-manifest.json

Backups always emit a portable ChromaDB artifact, regardless of whether the source vector store is qdrant, chromadb, pinecone, or pgvector. The manifest records the source backend together with the embedding provider, model, and final dimension so the artifact can be traced back to the ingestion run that created it.

These artifacts are consumed by the Jenkins job for archiving and for backing up the dynamically resolved target name after ingestion.

VS Code¶

The committed .vscode/ config assumes this workflow:

Run make editor-up from this repo.
Use Dev Containers: Attach to Running Container....
Attach to the rag service container started from this repo.
Use the committed tasks for make ... targets.
Use the committed launch configs for ingestion or pytest inside the container.

The interpreter path inside the container is /opt/venv/bin/python.

CI/CD¶

The repo-local Jenkinsfile and .github/workflows/ci.yml both accept the ADR 0008 provider settings directly:

EMBEDDING_PROVIDER
EMBEDDING_MODEL
EMBEDDING_DIMENSION
EMBEDDING_API_KEY
VECTOR_STORE
OLLAMA_BASE_URL
CHROMADB_PERSIST_PATH
PINECONE_INDEX_NAME
PINECONE_INDEX_HOST
PINECONE_CLOUD
PINECONE_REGION
PGVECTOR_SCHEMA
WITH_PROVIDERS

Post-ingest CI flow:

make ingest
make validate
make backup
Archive ingestion-outputs.env, outputs/reports/**, and outputs/backups/**

Live CI operations remain manual-only for ingestion-style runs. Credentials should be provided by the CI system rather than committed env files.

After a successful ingestion run, share these values with the chain team:

VECTOR_STORE=<backend>
VECTOR_STORE_TARGET_NAME=<resolved-target-name>
EMBEDDING_PROVIDER=<provider>
EMBEDDING_MODEL=<model>
EMBEDDING_DIMENSION=<final-dimension>