Prompts & Tool Definitions

The Multimodal Video RAG system utilizes a LangGraph-powered agentic workflow. The agent orchestrates interaction between the user, the vector database (ChromaDB), and various inference models. This section outlines the logic governing the agent's behavior and the tools it uses to provide timestamped insights.

Agentic Workflow & System Prompts

The system relies on a set of specialized prompts to handle intent routing, query refinement, and multimodal reasoning.

Intent Routing

Before performing a search, the system classifies the user's input. This prevents unnecessary resource usage for general conversational queries.

Logic: The agent determines if the query requires factual information from the video library or is a general greeting/clarification.
Stages: thinking → classifying.

Query Rewriting

To improve retrieval accuracy from the vector store, the agent rewrites natural language questions into optimized search terms.

Logic: Converts "Where did he talk about the budget?" into a search-optimized query like "speaker discussing financial budget, revenue, and expenses."
Stage: rewriting.

Response Generation (Multimodal RAG)

The core prompt instructs the LLM to synthesize visual descriptions and audio transcripts into a cohesive answer with mandatory timestamping.

Instructions:
- Always cite the start and end timestamps for every claim.
- Distinguish between visual evidence (e.g., "An elephant is seen...") and audio evidence (e.g., "The speaker mentions...").
- If no relevant content is found in the search results, explicitly state that the information is not present in the ingested videos.

Tool Definitions

The AI agent interacts with the backend through a set of functional tools. These are internal interfaces that the LangGraph nodes call to perform actions.

1. Multimodal Search Tool

The primary tool for retrieving context. It performs a semantic search across both audio transcripts (Whisper) and visual scene descriptions (LLaVA/Llama 3.2 Vision).

Output (SearchResult): Returns a list of matching chunks containing content, timestamp_start, timestamp_end, and a score (relevance).

2. PII Detection & Anonymization

Powered by Microsoft Presidio, this tool identifies and masks Personally Identifiable Information (PII) before queries are processed by external LLM providers (e.g., OpenRouter).

Supported Entities: Names, Email Addresses, Phone Numbers, IP Addresses.
Action: Replaces sensitive data with placeholders (e.g., [PERSON_01]).
Stage: pii_anonymized.

3. Video Metadata Lookup

Provides the agent with technical details about a specific video to ensure it can reference titles and source URLs correctly in its final answer.

Observability & State Management

The agent communicates its progress via WebSockets using ChatEvent objects. This allows the frontend to display the "Thinking" process in real-time.

// Example ChatEvent for a searching operation
{
  "event": "status",
  "data": {
    "stage": "searching",
    "query": "elephants in the savanna",
    "intent": "search"
  },
  "timestamp": "2023-10-27T10:00:00Z"
}

Response Schema

When the agent completes its reasoning, it returns a structured ChatResponse:

{
  "answer": "Elephants appear at [00:04]. The speaker mentions their migration patterns at [00:12-00:45].",
  "sources": [
    {
      "video_id": "vid_123",
      "timestamp_seconds": 4,
      "visual_context": "A herd of elephants walking through tall grass."
    }
  ],
  "query": "When do elephants appear?"
}

Configuration

Prompts and tool behaviors can be influenced by the LLM_PROVIDER environment variable.

Ollama: Uses local system prompts optimized for Llama 3/Mistral.
OpenRouter: Uses cloud-optimized prompts with higher reasoning capabilities (e.g., Claude 3.5 Sonnet or GPT-4o).