Introduction

Overview

Multimodal Video RAG is a sophisticated Retrieval-Augmented Generation (RAG) system designed to transform video libraries into searchable, interactive knowledge bases. Unlike traditional video search tools that rely solely on metadata or manual tags, this system performs deep analysis of both audio transcripts and visual content.

By combining state-of-the-art vision LLMs (like LLaVA or Llama 3.2 Vision) with Whisper-based speech recognition, the system allows users to query video content using natural language and receive precise, timestamped responses that reference exactly what was said and seen.

Core Value Proposition

See and Hear: Search queries are executed against a unified index of visual descriptions (what is happening on screen) and audio transcripts (what is being said).
Context-Aware Reasoning: Utilize a LangGraph-powered conversational agent that remembers session context and can reason across multiple videos to answer complex questions.
Precision Navigation: Every answer includes direct links to specific timestamps, allowing users to jump straight to the relevant moment in a video.
Privacy-First Design: Integrated PII (Personally Identifiable Information) detection via Microsoft Presidio ensures sensitive data is identified and handled securely.
Infrastructure Flexibility: Run entirely locally using Ollama for maximum privacy and cost-efficiency, or scale with cloud providers like OpenRouter.

User Workflow

The system operates in two primary phases: Ingestion and Discovery.

1. Video Ingestion

Before a video can be searched, it must be processed through the ingestion pipeline. This is an asynchronous process that performs the following:

Media Extraction: Separates the audio stream and extracts key visual frames.
Transcription: Converts speech to text using Faster-Whisper.
Visual Labeling: Uses Vision LLMs to describe scenes and actions within the video frames.
Vector Indexing: Generates embeddings for both text and visual descriptions, storing them in a ChromaDB vector store for semantic retrieval.

2. Search and Chat

Once ingested, users can interact with the content through two distinct interfaces:

Semantic Search: Perform keyword or natural language searches to find specific moments across the entire library.
Chat Assistant: Engage in a dialogue with an AI agent. You can ask follow-up questions, request summaries, or ask the agent to compare content across different videos.

System Interfaces

Web Interface

The project includes a modern, AWS-styled dashboard built with Next.js 15. It provides a low-code way to:

Monitor ingestion job status with real-time WebSocket updates.
Manage the video library (upload, delete, and view metadata).
Search and play videos with timestamped highlights.
Interact with the Chat Assistant.

REST API

For developers looking to integrate multimodal RAG into their own applications, the system exposes a structured FastAPI backend.

| Endpoint | Method | Description | | :--- | :--- | :--- | | /api/v1/videos/ingest | POST | Trigger a new video ingestion job via YouTube URL. | | /api/v1/videos | GET | List all ingested videos and their processing status. | | /api/v1/search | POST | Execute a semantic search across audio and visual indexes. | | /api/v1/chat | POST | Send a query to the LangGraph conversational agent. | | /api/v1/ws/jobs/{id} | WS | Subscribe to real-time progress events for a specific job. |

Example Usage

With Multimodal Video RAG, you can move beyond simple keyword matching to complex visual and contextual queries:

User: "When does the presenter start drawing on the whiteboard?"

System: "The presenter begins drawing at [02:45]. This follows the discussion about system architecture that started at [01:15]."

User: "What does the diagram look like?"

System: "The diagram is a flowchart showing the data pipeline, featuring three main blocks: Ingestion, Processing, and Storage."