REST API Reference

The Multimodal Video RAG API follows RESTful principles and returns JSON-encoded responses. The base URL for all v1 endpoints is /api/v1.

Video Management

Endpoints for ingesting, listing, and managing video content.

Ingest a Video

POST /videos/ingest

Starts an asynchronous pipeline to download, transcribe, and index a video from a URL.

Request Body | Field | Type | Required | Description | | :--- | :--- | :--- | :--- | | url | string | Yes | The YouTube URL of the video to ingest. |

Response (VideoIngestResponse)

{
  "job_id": "job_01HXYZ123ABC",
  "video_id": "vid_01HXYZ123ABC",
  "status": "pending",
  "message": "Ingestion job created"
}

List Videos

GET /videos

Returns a list of all videos currently in the knowledge base.

Response (VideoListResponse)

{
  "videos": [
    {
      "video_id": "vid_01HXYZ123ABC",
      "title": "Introduction to Multimodal AI",
      "duration_seconds": 360,
      "source_url": "https://www.youtube.com/watch?v=...",
      "thumbnail_url": "https://...",
      "ingested_at": "2023-10-27T10:00:00Z",
      "chunk_count": 42,
      "searchable": true
    }
  ],
  "total": 1
}

Delete Video

DELETE /videos/{video_id}

Removes a video and all its associated embeddings (visual and audio) from the vector store.

Search

Endpoints for semantic retrieval across the video corpus.

Semantic Search

POST /search

Searches for specific moments within videos using natural language queries.

Response (SearchResponse)

{
  "query": "elephants",
  "results": [
    {
      "video_id": "vid_123",
      "chunk_id": "chk_456",
      "chunk_type": "visual",
      "timestamp_start": 45.5,
      "timestamp_end": 50.0,
      "content": "A large elephant walks across the savanna.",
      "score": 0.89,
      "preview_url": "https://..."
    }
  ],
  "total_results": 1,
  "search_time_ms": 120
}

Chat Assistant

Endpoints for conversational interaction with the video library.

Send Chat Message

POST /chat

Interact with an AI agent that reasons across your video content. This endpoint uses a LangGraph-powered router to decide whether to search the corpus or answer based on previous context.

Response (ChatResponse)

{
  "answer": "The speaker mentions the project timeline starting at [0:45].",
  "sources": [
    {
      "video_id": "vid_123",
      "timestamp": "0:45",
      "timestamp_seconds": 45,
      "transcript": "We are planning to launch the project in Q4.",
      "visual_context": "Speaker pointing to a slide titled 'Timeline'."
    }
  ],
  "query": "When does the project start?"
}

Real-time Updates (WebSockets)

For long-running tasks like ingestion, the system provides real-time progress updates via WebSockets.

Ingestion Job Progress

WS /ws/jobs/{job_id}

Message Payload (ProgressEvent)

{
  "job_id": "job_123",
  "video_id": "vid_123",
  "stage": "transcribing",
  "stage_number": 3,
  "total_stages": 5,
  "progress_percent": 60,
  "message": "Generating audio transcript...",
  "timestamp": "2023-10-27T10:05:00Z"
}

Data Models

JobStatus (Enum)

The status of a background ingestion job can be one of the following:

pending: Job created in queue.
downloading: Fetching video file from source.
extracting: Splitting audio and frames.
transcribing: Running Whisper ASR.
segmenting: Determining scene boundaries.
describing: Running Vision LLM on frames.
embedding: Generating vectors for content.
indexing: Saving to ChromaDB.
completed: Video is searchable.
failed: Job stopped due to an error.