Multimodal RAG Strategy
The Multimodal RAG (Retrieval-Augmented Generation) strategy employed in this system enables natural language discovery across two distinct yet synchronized data streams: Audio Transcripts and Visual Descriptions.
By converting both streams into a unified vector space, the system can answer questions like "When did the speaker mention the budget?" (Audio-based) and "When does the red car appear on screen?" (Visual-based) within the same query context.
Data Ingestion & Synchronization
The ingestion engine processes videos through a multi-stage pipeline to ensure temporal alignment between what is heard and what is seen.
1. Audio Stream (ASR)
The system extracts the audio track and processes it using Faster-Whisper.
- Output: Segments of text with precise
startandendtimestamps. - Indexing: Each segment is stored as an
AUDIOchunk type in the vector database.
2. Visual Stream (Vision LLM)
Keyframes are extracted at regular intervals or based on scene changes. These frames are passed to a Vision LLM (e.g., LLaVA or Llama 3.2 Vision).
- Output: Textual descriptions of the visual events occurring in the frame.
- Indexing: These descriptions are stored as
VISUALchunk types, mapped to the specific timestamp of the frame.
3. Unified Vector Space
Both audio segments and visual descriptions are embedded using the same embedding model and stored in ChromaDB. This allows a single query to retrieve relevant context from both modalities simultaneously.
Search Discovery
The /search endpoint provides the primary interface for semantic discovery. You can configure the search to target specific modalities depending on your use case.
Search Request Schema
{
"query": "string", // The natural language search query
"top_k": 5, // Number of results to return (Default: 5)
"include_visual": true, // Whether to search visual descriptions
"include_audio": true, // Whether to search audio transcripts
"video_ids": ["string"] // Optional: Filter search to specific video IDs
}
Example Usage
To search for visual elements specifically:
curl -X POST http://localhost:8000/api/v1/search \
-H "Content-Type: application/json" \
-d '{
"query": "a person holding a blue folder",
"include_visual": true,
"include_audio": false
}'
Conversational Reasoning (LangGraph)
Beyond simple search, the system utilizes a LangGraph-powered Agent for complex reasoning. The agent follows a multi-step strategy when a user asks a question in the chat interface:
- Intent Classification: The agent determines if the query requires a search of the video library or can be answered using existing session context.
- Query Rewriting: The agent optimizes the user's natural language into a more effective search vector.
- Cross-Modal Retrieval: The agent queries the vector store, retrieving both
audioandvisualchunks. - Temporal Synthesis: The agent analyzes the timestamps of the retrieved chunks to provide a coherent answer, often combining visual "sightings" with audio "mentions."
Chat Response Structure
The chat response includes sources that provide direct links to the timestamps where the information was found.
| Field | Description |
| :--- | :--- |
| answer | The generated conversational response. |
| sources | An array of ChatSource objects containing video_id, timestamp, and transcript or visual_context. |
Configuration & Tuning
The behavior of the RAG strategy can be adjusted via environment variables to balance between speed (Cloud) and privacy (Local).
| Variable | Recommended Value | Impact |
| :--- | :--- | :--- |
| LLM_PROVIDER | ollama / openrouter | Switches between local (Llama 3) and cloud-based reasoning. |
| VISION_MODEL | llava:7b | Determines the detail level of visual descriptions. |
| CHUNK_STRATEGY | semantic | Influences how audio is segmented for indexing. |
By default, the system prioritizes local-first inference using Ollama, ensuring that your video frames and transcripts never leave your infrastructure unless explicitly configured to use a cloud provider like OpenRouter.