Visual Scene Description
The Visual Scene Description engine is the "eyes" of the Multimodal Video RAG system. While traditional RAG systems rely solely on audio transcripts, this system uses Vision-Language Models (VLM) to analyze video frames, allowing you to search for objects, actions, and atmospheric details that are never explicitly mentioned in the audio.
Overview
During the ingestion process, the system performs a multi-stage visual analysis:
- Frame Extraction: The system samples the video at regular intervals to capture key visual changes.
- VLM Inference: Each extracted frame (or sequence of frames) is passed to a Vision LLM (like LLaVA or Llama 3.2 Vision).
- Contextual Description: The model generates a natural language description of the scene (e.g., "A person in a red jacket is walking through a snowy forest").
- Vector Indexing: These descriptions are embedded and stored in ChromaDB as
visualchunks, separate from but synchronized withaudiochunks.
Configuring the Vision Backend
The system supports both local and cloud-based vision models. You can configure the backend in your .env file:
Local Inference (via Ollama)
For maximum privacy and zero cost, use Ollama to run vision models locally.
# .env configuration
LLM_PROVIDER=ollama
Before starting, ensure the vision model is pulled:
docker exec ollama ollama pull llava:7b
Cloud Inference (via OpenRouter)
For higher throughput or if you lack a local GPU, use OpenRouter to access models like Llama 3.2 Vision.
# .env configuration
LLM_PROVIDER=openrouter
OPENROUTER_API_KEY=your_api_key_here
Monitoring Ingestion
When you ingest a video via the /api/v1/videos/ingest endpoint, you can monitor the visual description progress through the UI or WebSocket events.
The visual analysis occurs during the Describing Scenes stage:
- Status Enum:
JobStatus.DESCRIBING - UI Indicator: The Stepper component will highlight "Describing Scenes" while the VLM is processing frames.
Searching Visual Content
To query the visual descriptions, ensure the include_visual flag is set to true in your search request.
API Usage
Endpoint: POST /api/v1/search
{
"query": "blue car driving on a highway",
"top_k": 5,
"include_visual": true,
"include_audio": false
}
Response Schema
The search results identify the source of the match via the chunk_type field. Visual matches will return descriptions generated by the VLM.
{
"results": [
{
"video_id": "vid_01HXYZ...",
"chunk_type": "visual",
"timestamp_start": 45.0,
"timestamp_end": 50.0,
"content": "A high-angle shot of a blue sedan traveling north on a multi-lane highway during sunset.",
"score": 0.89
}
]
}
Usage in Chat
The Chat Assistant uses these visual descriptions to answer complex reasoning questions. If you ask, "What color was the car in the first half of the video?", the LangGraph agent retrieves visual chunks to provide an answer, even if the speaker in the video never mentions the car's color.
| Capability | Audio RAG | Visual Scene Description | | :--- | :---: | :---: | | Search spoken words | ✅ | ❌ | | Search for visual objects | ❌ | ✅ | | Identify actions/gestures | ❌ | ✅ | | Timestamp accuracy | High (Word-level) | Medium (Frame-interval) |