Docker & Infrastructure

The Multimodal Video RAG system is built on a containerized architecture to ensure environment parity and simplify the orchestration of its heterogeneous components (GPU-accelerated ASR, Vector DB, and Vision LLMs).

Infrastructure Overview

The system utilizes Docker Compose to manage five primary services:

Backend: FastAPI server handling the LangGraph orchestration, ingestion logic, and PII anonymization.
Frontend: Next.js 15 application providing the dashboard, search interface, and real-time ingestion monitoring.
ChromaDB: Vector database for storing and querying multimodal embeddings.
Ollama: Local inference server for running vision models (LLaVA) and LLMs (Llama 3.1).
Faster-Whisper: Dedicated worker for CUDA-accelerated audio transcription.

Prerequisites

To run the local inference stack, the host machine requires:

Docker & Docker Compose
NVIDIA GPU with at least 8GB VRAM (for concurrent Vision and ASR tasks).
NVIDIA Container Toolkit installed and configured as the default Docker runtime.

Deployment Profiles

The project provides optimized configurations for different environments via specific compose files.

Development Environment

Used for local testing and feature development. It includes hot-reloading for the frontend and backend.

docker compose -f docker/docker-compose.dev.yml up -d

Production Environment

Optimized for performance, using pre-built images and stripped-down dependencies.

docker compose -f docker/docker-compose.yml up -d

Volume Management & Persistence

The system manages data across three persistent volumes to ensure that indexed videos and embeddings survive container restarts:

Configuration (Environment Variables)

Infrastructure behavior is controlled via a .env file located in the project root. Key variables include:

# Inference Steering
LLM_PROVIDER=ollama         # Options: 'ollama' or 'openrouter'
OPENROUTER_API_KEY=         # Required if provider is openrouter

# Resource Constraints
GPU_IDS=all                 # Specify which GPUs to expose to containers
MAX_CONCURRENT_INGESTIONS=2 # Limits parallel video processing to prevent OOM

Initializing Inference Models

After the containers are healthy, you must pull the required weights into the Ollama service. These models are stored in the ollama_models volume.

# Pull the Vision model for frame description
docker exec ollama ollama pull llava:7b

# Pull the Instruct model for the RAG agent
docker exec ollama ollama pull llama3.1:8b-instruct-q4_0

Health Monitoring

The Backend service exposes standard health check endpoints used by Docker to manage container lifecycle:

GET /health: Basic connectivity and application status.
GET /ready: Verification of database connections and model availability.

Network Topology

Services communicate over an internal bridge network:

Backend: Port 8000 (Internal/External)
Frontend: Port 3000 (Internal/External)
ChromaDB: Port 8000 (Internal Only)
Ollama: Port 11434 (Internal Only)