UI Architecture
The Multimodal Video RAG frontend is built with Next.js 15 using the App Router architecture. It provides a specialized interface for video ingestion, real-time processing feedback, and an AI-driven search and chat experience.
Project Structure
The frontend follows a modular directory structure under src/, separating route logic from reusable UI components and API clients.
src/
├── app/ # Next.js App Router (Pages & Layouts)
│ ├── chat/ # Conversational AI interface
│ ├── ingest/ # Video submission and processing status
│ ├── search/ # Semantic search discovery
│ └── videos/ # Video library management
├── components/ # Reusable UI components
│ ├── chat/ # Chat-specific logic
│ ├── common/ # Generic components (Modals, Tables, Players)
│ └── search/ # Search-specific components
├── lib/ # Core logic and utilities
│ ├── api/ # REST API client wrappers
│ ├── hooks/ # Custom React hooks (e.g., useWebSocket)
│ └── types.ts # Shared TypeScript interfaces
Core Routing & Pages
The application is divided into four main functional areas:
1. Video Ingestion (/ingest)
Handles the workflow of adding new content to the vector store.
- Submission: A form-based interface for entering YouTube URLs.
- Job Tracking (
/ingest/jobs/[jobId]): A real-time monitoring page that connects via WebSockets to provide live updates on the multi-stage ingestion process (downloading, transcribing, describing, etc.).
2. Search Discovery (/search)
A specialized interface for semantic retrieval. Users can query the multimodal index using natural language. The results display timestamped segments with visual or audio context, allowing users to jump directly to the relevant moment in the video.
3. Chat Assistant (/chat)
A sophisticated agentic interface powered by LangGraph.
- Contextual Reasoning: The agent can reference specific video segments to answer complex questions.
- Streaming Responses: Utilizes progress events to show the agent's internal "thought process" (e.g., "Classifying intent," "Searching video corpus").
4. Video Management (/videos)
The administrative view of the system.
- Video Library: A tabular view of all ingested videos, their processing status, and metadata.
- Detail View (
/videos/[videoId]): Provides technical details about a specific video, including duration, chunk count, and source information.
Communication Patterns
The UI utilizes three primary methods to communicate with the FastAPI backend:
REST API
Standard data fetching and mutations are handled via fetch or TanStack Query.
// Example: Initiating search
const results = await searchVideos({
query: "When do the elephants appear?",
top_k: 10,
include_visual: true
});
WebSockets
Used for long-running asynchronous tasks where immediate feedback is critical. The useWebSocket hook manages the connection to the ingestion worker.
// src/lib/hooks/useWebSocket.ts usage in Ingest Job page
const { lastMessage } = useWebSocket<ProgressEvent>(`/api/v1/ws/jobs/${jobId}`);
// Message structure used to update the UI Stepper
interface ProgressEvent {
stage: 'downloading' | 'transcribing' | 'indexing';
progress_percent: number;
message: string;
}
Real-time Agent Events
The Chat interface consumes a stream of ChatEvent objects. This allows the UI to update dynamically as the AI performs different stages of reasoning (PII detection, query rewriting, and final generation).
Component Hierarchy
Video Player Wrapper
A specialized component that wraps standard HTML5/YouTube players to support timestamp seeking. When a user clicks a search result or a chat reference, the wrapper triggers a seek to the specific timestamp_start provided by the RAG engine.
Stepper UI
A visual progress indicator used during ingestion. It maps the internal JobStatus (defined in types.ts) to a user-friendly multi-step timeline.
Resource Table
A reusable, AWS-style data table used in the video management view. It supports sortable columns, status badges (e.g., "Ready" vs. "Processing"), and integrated action menus for deleting or viewing resources.
State Management
- Server State: Managed via
@tanstack/react-queryfor caching video lists and metadata. - Form State: Standard React
useStateanduseRefhooks for search and ingestion inputs. - Job State: Ephemeral state managed via WebSocket events to drive the ingestion progress UI.