Prompts & Tool Definitions
The Multimodal Video RAG system utilizes a LangGraph-powered agentic workflow. The agent orchestrates interaction between the user, the vector database (ChromaDB), and various inference models. This section outlines the logic governing the agent's behavior and the tools it uses to provide timestamped insights.
Agentic Workflow & System Prompts
The system relies on a set of specialized prompts to handle intent routing, query refinement, and multimodal reasoning.
Intent Routing
Before performing a search, the system classifies the user's input. This prevents unnecessary resource usage for general conversational queries.
- Logic: The agent determines if the query requires factual information from the video library or is a general greeting/clarification.
- Stages:
thinking→classifying.
Query Rewriting
To improve retrieval accuracy from the vector store, the agent rewrites natural language questions into optimized search terms.
- Logic: Converts "Where did he talk about the budget?" into a search-optimized query like "speaker discussing financial budget, revenue, and expenses."
- Stage:
rewriting.
Response Generation (Multimodal RAG)
The core prompt instructs the LLM to synthesize visual descriptions and audio transcripts into a cohesive answer with mandatory timestamping.
- Instructions:
- Always cite the
startandendtimestamps for every claim. - Distinguish between visual evidence (e.g., "An elephant is seen...") and audio evidence (e.g., "The speaker mentions...").
- If no relevant content is found in the search results, explicitly state that the information is not present in the ingested videos.
- Always cite the
Tool Definitions
The AI agent interacts with the backend through a set of functional tools. These are internal interfaces that the LangGraph nodes call to perform actions.
1. Multimodal Search Tool
The primary tool for retrieving context. It performs a semantic search across both audio transcripts (Whisper) and visual scene descriptions (LLaVA/Llama 3.2 Vision).
Input Parameters (SearchRequest):
| Parameter | Type | Description |
| :--- | :--- | :--- |
| query | string | The search-optimized string. |
| top_k | int | Number of relevant chunks to retrieve (default: 5). |
| video_ids | list[str] | Optional: Restrict search to specific videos. |
| include_visual| bool | Whether to query the visual embedding index. |
| include_audio | bool | Whether to query the transcript embedding index. |
Output (SearchResult):
Returns a list of matching chunks containing content, timestamp_start, timestamp_end, and a score (relevance).
2. PII Detection & Anonymization
Powered by Microsoft Presidio, this tool identifies and masks Personally Identifiable Information (PII) before queries are processed by external LLM providers (e.g., OpenRouter).
- Supported Entities: Names, Email Addresses, Phone Numbers, IP Addresses.
- Action: Replaces sensitive data with placeholders (e.g.,
[PERSON_01]). - Stage:
pii_anonymized.
3. Video Metadata Lookup
Provides the agent with technical details about a specific video to ensure it can reference titles and source URLs correctly in its final answer.
Output (VideoInfo):
| Field | Description |
| :--- | :--- |
| video_id | Unique identifier. |
| title | The title of the video. |
| source_url | Original YouTube link. |
| duration_seconds | Total length of the video. |
Observability & State Management
The agent communicates its progress via WebSockets using ChatEvent objects. This allows the frontend to display the "Thinking" process in real-time.
// Example ChatEvent for a searching operation
{
"event": "status",
"data": {
"stage": "searching",
"query": "elephants in the savanna",
"intent": "search"
},
"timestamp": "2023-10-27T10:00:00Z"
}
Response Schema
When the agent completes its reasoning, it returns a structured ChatResponse:
{
"answer": "Elephants appear at [00:04]. The speaker mentions their migration patterns at [00:12-00:45].",
"sources": [
{
"video_id": "vid_123",
"timestamp_seconds": 4,
"visual_context": "A herd of elephants walking through tall grass."
}
],
"query": "When do elephants appear?"
}
Configuration
Prompts and tool behaviors can be influenced by the LLM_PROVIDER environment variable.
- Ollama: Uses local system prompts optimized for Llama 3/Mistral.
- OpenRouter: Uses cloud-optimized prompts with higher reasoning capabilities (e.g., Claude 3.5 Sonnet or GPT-4o).