Visual Scene Description

The Visual Scene Description engine is the "eyes" of the Multimodal Video RAG system. While traditional RAG systems rely solely on audio transcripts, this system uses Vision-Language Models (VLM) to analyze video frames, allowing you to search for objects, actions, and atmospheric details that are never explicitly mentioned in the audio.

Overview

During the ingestion process, the system performs a multi-stage visual analysis:

Frame Extraction: The system samples the video at regular intervals to capture key visual changes.
VLM Inference: Each extracted frame (or sequence of frames) is passed to a Vision LLM (like LLaVA or Llama 3.2 Vision).
Contextual Description: The model generates a natural language description of the scene (e.g., "A person in a red jacket is walking through a snowy forest").
Vector Indexing: These descriptions are embedded and stored in ChromaDB as visual chunks, separate from but synchronized with audio chunks.

Configuring the Vision Backend

The system supports both local and cloud-based vision models. You can configure the backend in your .env file:

Local Inference (via Ollama)

For maximum privacy and zero cost, use Ollama to run vision models locally.

# .env configuration
LLM_PROVIDER=ollama

Before starting, ensure the vision model is pulled:

docker exec ollama ollama pull llava:7b

Cloud Inference (via OpenRouter)

For higher throughput or if you lack a local GPU, use OpenRouter to access models like Llama 3.2 Vision.

# .env configuration
LLM_PROVIDER=openrouter
OPENROUTER_API_KEY=your_api_key_here

Monitoring Ingestion

When you ingest a video via the /api/v1/videos/ingest endpoint, you can monitor the visual description progress through the UI or WebSocket events.

The visual analysis occurs during the Describing Scenes stage:

Status Enum: JobStatus.DESCRIBING
UI Indicator: The Stepper component will highlight "Describing Scenes" while the VLM is processing frames.

Searching Visual Content

To query the visual descriptions, ensure the include_visual flag is set to true in your search request.

API Usage

Endpoint: POST /api/v1/search

{
  "query": "blue car driving on a highway",
  "top_k": 5,
  "include_visual": true,
  "include_audio": false
}

Response Schema

The search results identify the source of the match via the chunk_type field. Visual matches will return descriptions generated by the VLM.

{
  "results": [
    {
      "video_id": "vid_01HXYZ...",
      "chunk_type": "visual",
      "timestamp_start": 45.0,
      "timestamp_end": 50.0,
      "content": "A high-angle shot of a blue sedan traveling north on a multi-lane highway during sunset.",
      "score": 0.89
    }
  ]
}

Usage in Chat

The Chat Assistant uses these visual descriptions to answer complex reasoning questions. If you ask, "What color was the car in the first half of the video?", the LangGraph agent retrieves visual chunks to provide an answer, even if the speaker in the video never mentions the car's color.

| Capability | Audio RAG | Visual Scene Description | | :--- | :---: | :---: | | Search spoken words | ✅ | ❌ | | Search for visual objects | ❌ | ✅ | | Identify actions/gestures | ❌ | ✅ | | Timestamp accuracy | High (Word-level) | Medium (Frame-interval) |