Quick Start
Prerequisites
Before starting, ensure your system meets the following requirements:
- Docker & Docker Compose installed.
- NVIDIA GPU with NVIDIA Container Toolkit (required for GPU acceleration of Whisper ASR and Vision LLMs).
- Python 3.11+ (for local development outside of Docker).
1. Installation
Clone the repository and navigate to the project root:
git clone https://github.com/Asirwad/multimodal-video-rag.git
cd multimodal-video-rag
2. Environment Configuration
The system uses a .env file for configuration. Copy the example and adjust as needed:
cp .env.example .env
To use cloud-based inference (faster if you lack a high-end GPU), set your provider to openrouter and provide an API key. Otherwise, the system defaults to ollama for local inference.
LLM_PROVIDER=ollama
# If using OpenRouter:
# LLM_PROVIDER=openrouter
# OPENROUTER_API_KEY=your_api_key_here
3. Start Services
Launch the infrastructure using Docker Compose. This starts the FastAPI backend, Next.js frontend, ChromaDB (vector store), and Ollama.
docker compose -f docker/docker-compose.dev.yml up -d
4. Download Models
If you are using local inference, you must pull the required vision and language models into the Ollama container:
# Vision model for scene description
docker exec ollama ollama pull llava:7b
# Language model for reasoning and chat
docker exec ollama ollama pull llama3.1:8b-instruct-q4_0
5. Basic Usage Workflow
Step 1: Ingest a Video
You can ingest a video via the Web UI at http://localhost:3000/ingest or via a POST request to the API:
curl -X POST http://localhost:8000/api/v1/videos/ingest \
-H "Content-Type: application/json" \
-d '{"url": "https://www.youtube.com/watch?v=jNQXAC9IVRw"}'
Note: The system will return a job_id. You can track the progress of frame extraction, transcription, and indexing via WebSockets.
Step 2: Search Your Library
Once the status is "Ready," perform a multimodal search across both audio transcripts and visual descriptions:
curl -X POST http://localhost:8000/api/v1/search \
-H "Content-Type: application/json" \
-d '{
"query": "find moments where elephants are walking",
"top_k": 5
}'
Step 3: Chat with the AI Agent
Navigate to the Chat Assistant in the UI to ask complex questions. The LangGraph agent will determine if it needs to search your video corpus or provide a direct response.
Example Query: "Summarize what the speaker said about climate change and show me the visual evidence from the 5-minute mark."
UI Entrypoints
| Interface | URL | Description |
|-----------|-----|-------------|
| Dashboard | http://localhost:3000/ | System health, storage stats, and recent operations. |
| Ingest | http://localhost:3000/ingest | Upload/link YouTube videos for processing. |
| Search | http://localhost:3000/search | Semantic discovery across all indexed videos. |
| Chat | http://localhost:3000/chat | Agentic interface for deep reasoning about video content. |
| API Docs | http://localhost:8000/docs | Interactive Swagger/OpenAPI documentation. |