Ingestion Orchestrator

Overview

The Ingestion Orchestrator is a multi-stage pipeline responsible for transforming raw video content into a searchable, multimodal knowledge base. It coordinates the execution of various AI models—including Whisper for speech-to-text and Vision LLMs for image description—before indexing the results into a vector database.

Processing is handled asynchronously to accommodate the high compute requirements of video analysis.

The Ingestion Workflow

When a video URL is submitted, the orchestrator moves through the following stages:

Download & Extraction: The video is fetched from the source (e.g., YouTube) and split into its constituent parts: a high-quality audio stream and a sequence of keyframes.
Transcription (ASR): The audio is processed using Faster-Whisper to generate a timestamped transcript.
Visual Description: Selected video segments are analyzed by a Vision LLM (such as LLaVA or Llama 3.2 Vision). The model generates natural language descriptions of the visual events occurring in each segment.
Embedding & Indexing: Both the text transcripts and visual descriptions are converted into vector embeddings and stored in ChromaDB, making the content searchable via semantic queries.

API Usage

Start an Ingestion Job

To begin processing a video, send a POST request to the ingestion endpoint.

Endpoint: POST /api/v1/videos/ingest

Request Body (VideoIngestRequest):

{
  "url": "https://www.youtube.com/watch?v=example_id"
}

Response (VideoIngestResponse): The API returns a job_id which you must use to track the progress of the long-running task.

{
  "job_id": "job_01HXYZ123ABC",
  "video_id": "vid_01HXYZ123ABC",
  "status": "pending",
  "message": "Ingestion job created"
}

Monitoring Progress

Because ingestion involves heavy GPU inference, it is important to monitor the job status.

Status Polling

You can retrieve the current state of a job at any time.

Endpoint: GET /api/v1/videos/jobs/{job_id}

Response (JobStatusResponse):

{
  "job_id": "job_01HXYZ123ABC",
  "video_id": "vid_01HXYZ123ABC",
  "status": "transcribing",
  "progress_percent": 45,
  "current_stage": "transcribing",
  "message": "Processing audio with Whisper...",
  "started_at": "2023-10-27T10:00:00Z"
}

Real-time Updates via WebSockets

For a more responsive user experience, the system provides a WebSocket endpoint that streams ProgressEvent packets as the pipeline moves through its stages.

Endpoint: ws://[host]/api/v1/ws/jobs/{job_id}

Job Statuses

The orchestrator tracks jobs using the following lifecycle states:

Configuration & Scaling

The orchestrator's performance is largely determined by the LLM_PROVIDER setting:

Local (Ollama): Uses local GPU resources. Recommended for privacy or offline use. Ensure llava or llama3.2-vision models are pulled before starting.
Cloud (OpenRouter): Offloads visual and text reasoning to cloud APIs. This is often faster for users without a high-end local GPU.

To switch providers, update your .env file:

LLM_PROVIDER=openrouter
OPENROUTER_API_KEY=your_key_here