System Architecture
The Multimodal Video RAG system is built on a decoupled architecture designed for high-throughput video processing and low-latency semantic retrieval. It leverages a modern stack of FastAPI, LangGraph, and Next.js to bridge the gap between raw video data and conversational AI.
High-Level Overview
The system architecture is divided into two primary logical flows: the Ingestion Pipeline (asynchronous processing) and the RAG Query Engine (synchronous retrieval and generation).
Core Components
1. The Ingestion Pipeline
The ingestion pipeline converts raw video files into a searchable multimodal index. When a user provides a YouTube URL, the system initiates a multi-stage background job:
- Extraction: Decouples the audio stream from the video frames.
- Audio Processing: Uses Faster-Whisper to generate timestamped transcriptions.
- Visual Processing: Samples frames and uses Vision Language Models (VLM) like LLaVA to generate descriptive text for visual scenes.
- Vector Indexing: Both transcriptions and visual descriptions are embedded and stored in ChromaDB, allowing for cross-modal semantic search.
2. The RAG Query Engine (LangGraph)
The retrieval-augmented generation flow is orchestrated by a LangGraph-powered agent. This allows the system to maintain context and handle complex reasoning:
- Intent Routing: The agent determines if the user is asking for a specific search ("Find the dog"), a general summary ("What happens in this video?"), or a follow-up question.
- Multimodal Retrieval: The system queries the vector store for both transcript segments and visual descriptions.
- Response Synthesis: The LLM combines the retrieved context with the user's query to generate an answer, complete with precise timestamps.
3. Real-time Communication
Because video ingestion is resource-intensive, the system utilizes WebSockets for real-time state synchronization.
- Endpoint:
/api/v1/ws/jobs/{job_id} - Purpose: Streams granular progress events (e.g., "Transcribing", "Indexing") to the frontend, allowing the user to monitor the pipeline status without polling.
Data Flow
Video Ingestion Flow
- POST
/api/v1/videos/ingest: Accepts a URL and returns ajob_id. - Worker Task: Triggers the sequential extraction and AI description steps.
- State Update: The backend emits
ProgressEventobjects via WebSocket. - Completion: Once the
statusreachesCOMPLETED, the video is marked assearchable: truein the metadata store.
Search and Chat Flow
- POST
/api/v1/search: Executes a direct vector search across visual and audio embeddings. Returns aSearchResponsecontainingSearchResultobjects with timestamps and scores. - POST
/api/v1/chat: Initiates a stateful session with the LangGraph agent. The agent may perform multiple internal searches to gather sufficient context before returning aChatResponse.
Tech Stack Integration
| Component | Technology | Role | | :--- | :--- | :--- | | Backend | FastAPI | RESTful API, WebSocket management, and service orchestration. | | Orchestration | LangGraph | Agentic workflow and state management for RAG. | | Vector Database| ChromaDB | Storage and retrieval of multimodal embeddings. | | ASR Engine | Faster-Whisper | High-speed, CUDA-accelerated audio transcription. | | Vision Models | LLaVA / Llama 3.2V | Generating textual descriptions of visual frames. | | Inference | Ollama / OpenRouter | Local or cloud-based LLM execution. | | Frontend | Next.js 15 | Interactive dashboard, video player, and real-time logs. |
System Constraints & Requirements
- Compute: Requires NVIDIA GPU support (via Docker) for CUDA-accelerated transcription and local VLM inference.
- Storage: Vector storage scales linearly with video duration and sampling frequency.
- Privacy: Supports PII detection and anonymization via Presidio before data is sent to external LLM providers (when using OpenRouter).