Ingestion Orchestrator
Overview
The Ingestion Orchestrator is a multi-stage pipeline responsible for transforming raw video content into a searchable, multimodal knowledge base. It coordinates the execution of various AI models—including Whisper for speech-to-text and Vision LLMs for image description—before indexing the results into a vector database.
Processing is handled asynchronously to accommodate the high compute requirements of video analysis.
The Ingestion Workflow
When a video URL is submitted, the orchestrator moves through the following stages:
- Download & Extraction: The video is fetched from the source (e.g., YouTube) and split into its constituent parts: a high-quality audio stream and a sequence of keyframes.
- Transcription (ASR): The audio is processed using
Faster-Whisperto generate a timestamped transcript. - Visual Description: Selected video segments are analyzed by a Vision LLM (such as LLaVA or Llama 3.2 Vision). The model generates natural language descriptions of the visual events occurring in each segment.
- Embedding & Indexing: Both the text transcripts and visual descriptions are converted into vector embeddings and stored in ChromaDB, making the content searchable via semantic queries.
API Usage
Start an Ingestion Job
To begin processing a video, send a POST request to the ingestion endpoint.
Endpoint: POST /api/v1/videos/ingest
Request Body (VideoIngestRequest):
{
"url": "https://www.youtube.com/watch?v=example_id"
}
Response (VideoIngestResponse):
The API returns a job_id which you must use to track the progress of the long-running task.
{
"job_id": "job_01HXYZ123ABC",
"video_id": "vid_01HXYZ123ABC",
"status": "pending",
"message": "Ingestion job created"
}
Monitoring Progress
Because ingestion involves heavy GPU inference, it is important to monitor the job status.
Status Polling
You can retrieve the current state of a job at any time.
Endpoint: GET /api/v1/videos/jobs/{job_id}
Response (JobStatusResponse):
{
"job_id": "job_01HXYZ123ABC",
"video_id": "vid_01HXYZ123ABC",
"status": "transcribing",
"progress_percent": 45,
"current_stage": "transcribing",
"message": "Processing audio with Whisper...",
"started_at": "2023-10-27T10:00:00Z"
}
Real-time Updates via WebSockets
For a more responsive user experience, the system provides a WebSocket endpoint that streams ProgressEvent packets as the pipeline moves through its stages.
Endpoint: ws://[host]/api/v1/ws/jobs/{job_id}
Event Schema:
| Field | Type | Description |
| :--- | :--- | :--- |
| stage | string | The current identifier (e.g., "extracting", "describing"). |
| progress_percent | int | Completion percentage (0-100). |
| message | string | Human-readable log of the current activity. |
| timestamp | iso8601 | When the event occurred. |
Job Statuses
The orchestrator tracks jobs using the following lifecycle states:
| Status | Description |
| :--- | :--- |
| PENDING | Job is queued and waiting for resources. |
| DOWNLOADING | Fetching the video from the remote source. |
| EXTRACTING | Splitting the video into audio and frame sequences. |
| TRANSCRIBING | Running Speech-to-Text inference. |
| SEGMENTING | Partitioning the video for visual analysis. |
| DESCRIBING | Running Vision LLM inference on video frames. |
| EMBEDDING | Generating vector representations of extracted data. |
| INDEXING | Writing embeddings to the vector store (ChromaDB). |
| COMPLETED | Video is fully indexed and ready for search. |
| FAILED | An error occurred during processing. See message for details. |
Configuration & Scaling
The orchestrator's performance is largely determined by the LLM_PROVIDER setting:
- Local (Ollama): Uses local GPU resources. Recommended for privacy or offline use. Ensure
llavaorllama3.2-visionmodels are pulled before starting. - Cloud (OpenRouter): Offloads visual and text reasoning to cloud APIs. This is often faster for users without a high-end local GPU.
To switch providers, update your .env file:
LLM_PROVIDER=openrouter
OPENROUTER_API_KEY=your_key_here