LangGraph Agent Workflow
LangGraph Agent Workflow
The "Chat Assistant" in this system is powered by a LangGraph state machine. Unlike a simple linear chain, this agentic workflow enables the system to reason about user queries, decide whether to search the video corpus, and maintain context across conversational turns.
Agentic Reasoning Loop
The agent follows a multi-stage graph to process queries. This ensures that the system doesn't just "search and find" but actually understands the intent behind the user's request.
- Intent Classification: The agent analyzes the input to determine if the user is asking a question about videos, requesting a general conversation, or following up on a previous response.
- Query Rewriting & Anonymization:
- PII Detection: Before processing, sensitive information is identified and anonymized via Microsoft Presidio.
- Optimization: If a search is required, the agent rewrites the user's natural language into a high-signal search query optimized for the vector database.
- Multimodal Retrieval: The agent executes a search across two distinct vector indices:
- Audio Index: For spoken content (transcribed via Faster-Whisper).
- Visual Index: For scene descriptions (generated via LLaVA/Llama 3.2 Vision).
- Contextual Synthesis: The agent receives the top $K$ relevant chunks, including timestamps and visual context, and generates a final response that cites specific moments in the video.
Interaction API
The agent is accessible via the /api/v1/chat endpoint. It supports both standard JSON responses and real-time event streaming via WebSockets to provide "thinking" updates to the UI.
Request Schema
interface ChatRequest {
query: string; // User's question
video_ids?: string[]; // Optional: Filter search to specific videos
session_id?: string; // For maintaining conversation history
}
Response Schema
The agent returns a structured response containing the answer and the specific evidence (sources) used to generate it.
interface ChatResponse {
answer: string;
sources: ChatSource[];
query: string;
}
interface ChatSource {
video_id: string;
timestamp_seconds: number; // Exact start time of the reference
transcript?: string; // Text from audio (if audio source)
visual_context?: string; // Description of scene (if visual source)
}
Real-Time State Monitoring
Because the workflow involves multiple steps (rewriting, searching, reasoning), the system emits ChatEvent updates. These are used by the frontend to display the agent's current "thought process."
| Event Stage | Description |
| :--- | :--- |
| thinking | Initializing the state machine. |
| classifying | Determining if a vector search is necessary. |
| pii_anonymized | Privacy filter has processed the input. |
| searching | Querying the multimodal vector store. |
| generating | Synthesis of findings into a natural language answer. |
| complete | Final payload is ready. |
Example Usage
To interact with the agent via curl:
curl -X POST http://localhost:8000/api/v1/chat \
-H "Content-Type: application/json" \
-d '{
"query": "What happens after the chef adds the salt?",
"video_ids": ["vid_01HXYZ123ABC"]
}'
The resulting answer will not only describe the actions but will provide timestamp_seconds allowing the UI to seek the video player to the exact moment the event occurs.