Multimodal RAG Strategy

The Multimodal RAG (Retrieval-Augmented Generation) strategy employed in this system enables natural language discovery across two distinct yet synchronized data streams: Audio Transcripts and Visual Descriptions.

By converting both streams into a unified vector space, the system can answer questions like "When did the speaker mention the budget?" (Audio-based) and "When does the red car appear on screen?" (Visual-based) within the same query context.

Data Ingestion & Synchronization

The ingestion engine processes videos through a multi-stage pipeline to ensure temporal alignment between what is heard and what is seen.

1. Audio Stream (ASR)

The system extracts the audio track and processes it using Faster-Whisper.

Output: Segments of text with precise start and end timestamps.
Indexing: Each segment is stored as an AUDIO chunk type in the vector database.

2. Visual Stream (Vision LLM)

Keyframes are extracted at regular intervals or based on scene changes. These frames are passed to a Vision LLM (e.g., LLaVA or Llama 3.2 Vision).

Output: Textual descriptions of the visual events occurring in the frame.
Indexing: These descriptions are stored as VISUAL chunk types, mapped to the specific timestamp of the frame.

3. Unified Vector Space

Both audio segments and visual descriptions are embedded using the same embedding model and stored in ChromaDB. This allows a single query to retrieve relevant context from both modalities simultaneously.

Search Discovery

The /search endpoint provides the primary interface for semantic discovery. You can configure the search to target specific modalities depending on your use case.

Search Request Schema

{
  "query": "string",          // The natural language search query
  "top_k": 5,                 // Number of results to return (Default: 5)
  "include_visual": true,     // Whether to search visual descriptions
  "include_audio": true,      // Whether to search audio transcripts
  "video_ids": ["string"]    // Optional: Filter search to specific video IDs
}

Example Usage

To search for visual elements specifically:

curl -X POST http://localhost:8000/api/v1/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "a person holding a blue folder",
    "include_visual": true,
    "include_audio": false
  }'

Conversational Reasoning (LangGraph)

Beyond simple search, the system utilizes a LangGraph-powered Agent for complex reasoning. The agent follows a multi-step strategy when a user asks a question in the chat interface:

Intent Classification: The agent determines if the query requires a search of the video library or can be answered using existing session context.
Query Rewriting: The agent optimizes the user's natural language into a more effective search vector.
Cross-Modal Retrieval: The agent queries the vector store, retrieving both audio and visual chunks.
Temporal Synthesis: The agent analyzes the timestamps of the retrieved chunks to provide a coherent answer, often combining visual "sightings" with audio "mentions."

Chat Response Structure

The chat response includes sources that provide direct links to the timestamps where the information was found.

Configuration & Tuning

The behavior of the RAG strategy can be adjusted via environment variables to balance between speed (Cloud) and privacy (Local).

By default, the system prioritizes local-first inference using Ollama, ensuring that your video frames and transcripts never leave your infrastructure unless explicitly configured to use a cloud provider like OpenRouter.