Contributing to the RAG Ingestion Pipeline¶
This guide covers working on backend/rag_ingestion/ — the offline pipeline that builds the Qdrant vector collection.
Repo: aharbii/movie-finder-rag
For cross-cutting conventions see the Contributing Overview.
What this pipeline does¶
- Downloads the Kaggle movie dataset via
kagglehub - Generates
text-embedding-3-largeembeddings (3072 dimensions) for each movie's plot - Upserts the vectors into the
moviescollection in Qdrant Cloud
This is an offline, manually triggered pipeline — it is not part of the application request path. Run it once to populate Qdrant, then only again when the dataset or embedding model changes.
Development setup¶
rag_ingestion/ is a standalone uv project (not a uv workspace member). From backend/rag_ingestion/:
make dev # build + start dev container
make test # run pytest inside Docker
make lint # ruff check + format check
make typecheck # mypy --strict
make pre-commit # all hooks
To run the full ingestion (requires a Qdrant write key):
Embedding model coordination¶
Critical: The embedding model used at ingestion time must match the model used at query time.
| Setting | Ingestion (rag_ingestion/) |
Query time (chain/) |
|---|---|---|
EMBEDDING_MODEL |
text-embedding-3-large |
text-embedding-3-large |
EMBEDDING_DIMENSION |
3072 |
3072 |
If you change the embedding model, update both repos and re-run the full ingestion to rebuild the collection. The existing vectors become incompatible with queries using a different model.
Design patterns¶
- Strategy pattern — the embedding provider is an injectable strategy; no
if provider == "openai"branching in core pipeline logic - Configuration object — all settings via
config.py(Pydantic BaseSettings); neveros.getenv()scattered through business logic
Running ingestion via Jenkins¶
The Jenkins pipeline supports a manual ingest trigger:
- Jenkins →
movie-finder-rag→main→ Build with Parameters - Set
RUN_INGESTION=true - Set
COLLECTION_NAME=movies(or a test collection name to avoid overwriting production data)
Required Jenkins credentials: qdrant-url, qdrant-api-key-rw, openai-api-key, kaggle-api-token.
Environment variables¶
Copy .env.example to .env and fill in:
OPENAI_API_KEY
QDRANT_URL, QDRANT_API_KEY_RW, QDRANT_COLLECTION_NAME
KAGGLE_API_TOKEN
EMBEDDING_MODEL=text-embedding-3-large
EMBEDDING_DIMENSION=3072
Code standards¶
mypy --strictmust pass- No raw
os.getenv()— useconfig.py - No
print()— uselogging - Line length: 100