Contributing to the RAG Ingestion Pipeline¶

This guide covers working on rag/ — the offline pipeline that builds the Qdrant vector collection. Repo: aharbii/movie-finder-rag

For cross-cutting conventions see the Contributing Overview.

What this pipeline does¶

Downloads the Kaggle movie dataset via kagglehub
Generates text-embedding-3-large embeddings (3072 dimensions) for each movie's plot
Upserts the vectors into the movies collection in Qdrant Cloud

This is an offline, manually triggered pipeline — it is not part of the application request path. Run it once to populate Qdrant, then only again when the dataset or embedding model changes.

Development setup¶

rag_ingestion/ is a standalone uv project (not a uv workspace member). From rag/:

make dev         # build + start dev container
make test        # run pytest inside Docker
make lint        # ruff check + format check
make typecheck   # mypy --strict
make pre-commit  # all hooks

To run the full ingestion (requires a Qdrant write key):

make ingest

Embedding model coordination¶

Critical: The embedding model used at ingestion time must match the model used at query time.

Setting	Ingestion (`rag_ingestion/`)	Query time (`chain/`)
`EMBEDDING_MODEL`	`text-embedding-3-large`	`text-embedding-3-large`
`EMBEDDING_DIMENSION`	`3072`	`3072`

If you change the embedding model, update both repos and re-run the full ingestion to rebuild the collection. The existing vectors become incompatible with queries using a different model.

Design patterns¶

Strategy pattern — the embedding provider is an injectable strategy; no if provider == "openai" branching in core pipeline logic
Configuration object — all settings via config.py (Pydantic BaseSettings); never os.getenv() scattered through business logic

Running ingestion via Jenkins¶

The Jenkins pipeline supports a manual ingest trigger:

Jenkins → movie-finder-rag → main → Build with Parameters
Set RUN_INGESTION=true
Set COLLECTION_NAME=movies (or a test collection name to avoid overwriting production data)

Required Jenkins credentials: qdrant-url, qdrant-api-key-rw, openai-api-key, kaggle-api-token.

Environment variables¶

Copy .env.example to .env and fill in:

EMBEDDING_PROVIDER, EMBEDDING_MODEL, EMBEDDING_DIMENSION
VECTOR_STORE, VECTOR_COLLECTION_PREFIX
QDRANT_URL, QDRANT_API_KEY_RW  (when VECTOR_STORE=qdrant)
OPENAI_API_KEY, GOOGLE_API_KEY, OLLAMA_BASE_URL  (as selected)
KAGGLE_API_TOKEN

Code standards¶

mypy --strict must pass
No raw os.getenv() — use config.py
No print() — use logging
Line length: 100