LangGraph Agent Workflow

The "Chat Assistant" in this system is powered by a LangGraph state machine. Unlike a simple linear chain, this agentic workflow enables the system to reason about user queries, decide whether to search the video corpus, and maintain context across conversational turns.

Agentic Reasoning Loop

The agent follows a multi-stage graph to process queries. This ensures that the system doesn't just "search and find" but actually understands the intent behind the user's request.

Intent Classification: The agent analyzes the input to determine if the user is asking a question about videos, requesting a general conversation, or following up on a previous response.
Query Rewriting & Anonymization:
- PII Detection: Before processing, sensitive information is identified and anonymized via Microsoft Presidio.
- Optimization: If a search is required, the agent rewrites the user's natural language into a high-signal search query optimized for the vector database.
Multimodal Retrieval: The agent executes a search across two distinct vector indices:
- Audio Index: For spoken content (transcribed via Faster-Whisper).
- Visual Index: For scene descriptions (generated via LLaVA/Llama 3.2 Vision).
Contextual Synthesis: The agent receives the top $K$ relevant chunks, including timestamps and visual context, and generates a final response that cites specific moments in the video.

Interaction API

The agent is accessible via the /api/v1/chat endpoint. It supports both standard JSON responses and real-time event streaming via WebSockets to provide "thinking" updates to the UI.

Request Schema

interface ChatRequest {
    query: string;         // User's question
    video_ids?: string[];  // Optional: Filter search to specific videos
    session_id?: string;   // For maintaining conversation history
}

Response Schema

The agent returns a structured response containing the answer and the specific evidence (sources) used to generate it.

interface ChatResponse {
    answer: string;
    sources: ChatSource[];
    query: string;
}

interface ChatSource {
    video_id: string;
    timestamp_seconds: number; // Exact start time of the reference
    transcript?: string;       // Text from audio (if audio source)
    visual_context?: string;   // Description of scene (if visual source)
}

Real-Time State Monitoring

Because the workflow involves multiple steps (rewriting, searching, reasoning), the system emits ChatEvent updates. These are used by the frontend to display the agent's current "thought process."

Example Usage

To interact with the agent via curl:

curl -X POST http://localhost:8000/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What happens after the chef adds the salt?",
    "video_ids": ["vid_01HXYZ123ABC"]
  }'

The resulting answer will not only describe the actions but will provide timestamp_seconds allowing the UI to seek the video player to the exact moment the event occurs.