Movie Finder RAG¶
Offline ingestion pipeline for Movie Finder. Downloads the Wikipedia Movie Plots Kaggle dataset, generates OpenAI embeddings, and upserts them into a Qdrant Cloud collection.
Contributor workflow in this repo is strictly Docker-only: all quality gates (lint,
test, typecheck, test-coverage, pre-commit) execute through the provided Makefile.
Host-managed Python environments are not supported for development.
Overview¶
Kaggle CSV -> CSV Loader -> Embedding Provider -> Qdrant Cloud
(pandas) (OpenAI/Gemini) (write-capable key)
The pipeline is re-runnable whenever the dataset changes or a fresh collection needs to be built
for the chain package.
Local Workflow¶
Prerequisites¶
- Docker 24+ with the Compose plugin
- GNU Make
- Qdrant Cloud write-capable credentials
- OpenAI API key
- Kaggle API token (
KGAT_prefixed token)
Setup¶
make init # build dev + runtime images, create .env from template, install git hook
make editor-up # start the attached-container workspace
After make editor-up, open .env and fill in the required API keys.
make editor-up starts the long-lived workspace container used for VS Code attach/debug flows.
The actual ingestion pipeline is still a one-shot command.
Common Commands¶
make init # build dev + runtime images, create .env from template, install git hook
make editor-up # start the attached-container workspace
make editor-down # stop the local workspace container
make shell # open zsh shell in the workspace container
make lint # ruff check (report only)
make fix # ruff check --fix + ruff format (auto-apply)
make format # ruff format
make typecheck # mypy (strict)
make test # pytest tests/
make test-coverage # pytest + coverage.xml + htmlcov + JUnit report
make pre-commit # full hook suite (also enforced on git commit)
make check # lint + typecheck + test-coverage (CI gate)
Ingestion¶
make ingest runs the runtime image against external Qdrant Cloud using the variables in .env.
No local Qdrant compose workflow exists in this repo.
VS Code¶
The committed .vscode/ config assumes this workflow:
- Run
make editor-upfrom the host workspace. - Use
Dev Containers: Attach to Running Container.... - Attach to the
ragservice container started from this repo. - Use the committed tasks for
make ...targets. - Use the committed launch configs for ingestion or pytest inside the container.
The interpreter path inside the container is /opt/venv/bin/python.
Environment¶
Canonical variables for this repo (DNA aligned with infrastructure#9):
| Variable | Required | Description |
|---|---|---|
QDRANT_URL |
Yes | Qdrant Cloud cluster URL |
QDRANT_API_KEY_RW |
Yes | Write-capable Qdrant API key |
QDRANT_COLLECTION_NAME |
Yes | Shared collection name consumed by chain |
OPENAI_API_KEY |
Yes | OpenAI API key for embeddings |
KAGGLE_API_TOKEN |
Yes | Standalone token prefixed with KGAT_ |
CI/CD Pipeline¶
See Jenkinsfile for the full pipeline. The current stages line up with the repo-local
developer contract:
| Stage | Command / trigger | Notes |
|---|---|---|
| Lint + Typecheck | make lint + make typecheck |
PRs, main, tags |
| Test | make test-coverage |
PRs, main, tags |
| Build Image | docker build --target runtime ... |
main and tags |
| Ingest | Manual RUN_INGESTION=true |
Triggered via Jenkins parameters |
Sharing Outputs With Chain¶
After a successful ingestion run, share these values with the chain team: