System Architecture

The Multimodal Video RAG system is built on a decoupled architecture designed for high-throughput video processing and low-latency semantic retrieval. It leverages a modern stack of FastAPI, LangGraph, and Next.js to bridge the gap between raw video data and conversational AI.

High-Level Overview

The system architecture is divided into two primary logical flows: the Ingestion Pipeline (asynchronous processing) and the RAG Query Engine (synchronous retrieval and generation).

Core Components

1. The Ingestion Pipeline

The ingestion pipeline converts raw video files into a searchable multimodal index. When a user provides a YouTube URL, the system initiates a multi-stage background job:

Extraction: Decouples the audio stream from the video frames.
Audio Processing: Uses Faster-Whisper to generate timestamped transcriptions.
Visual Processing: Samples frames and uses Vision Language Models (VLM) like LLaVA to generate descriptive text for visual scenes.
Vector Indexing: Both transcriptions and visual descriptions are embedded and stored in ChromaDB, allowing for cross-modal semantic search.

2. The RAG Query Engine (LangGraph)

The retrieval-augmented generation flow is orchestrated by a LangGraph-powered agent. This allows the system to maintain context and handle complex reasoning:

Intent Routing: The agent determines if the user is asking for a specific search ("Find the dog"), a general summary ("What happens in this video?"), or a follow-up question.
Multimodal Retrieval: The system queries the vector store for both transcript segments and visual descriptions.
Response Synthesis: The LLM combines the retrieved context with the user's query to generate an answer, complete with precise timestamps.

3. Real-time Communication

Because video ingestion is resource-intensive, the system utilizes WebSockets for real-time state synchronization.

Endpoint: /api/v1/ws/jobs/{job_id}
Purpose: Streams granular progress events (e.g., "Transcribing", "Indexing") to the frontend, allowing the user to monitor the pipeline status without polling.

Data Flow

Video Ingestion Flow

POST /api/v1/videos/ingest: Accepts a URL and returns a job_id.
Worker Task: Triggers the sequential extraction and AI description steps.
State Update: The backend emits ProgressEvent objects via WebSocket.
Completion: Once the status reaches COMPLETED, the video is marked as searchable: true in the metadata store.

Search and Chat Flow

POST /api/v1/search: Executes a direct vector search across visual and audio embeddings. Returns a SearchResponse containing SearchResult objects with timestamps and scores.
POST /api/v1/chat: Initiates a stateful session with the LangGraph agent. The agent may perform multiple internal searches to gather sufficient context before returning a ChatResponse.

Tech Stack Integration

System Constraints & Requirements

Compute: Requires NVIDIA GPU support (via Docker) for CUDA-accelerated transcription and local VLM inference.
Storage: Vector storage scales linearly with video duration and sampling frequency.
Privacy: Supports PII detection and anonymization via Presidio before data is sent to external LLM providers (when using OpenRouter).