Audio Transcription (ASR)

The system utilizes Faster-Whisper to convert video audio tracks into high-fidelity, timestamped text. This component is critical for semantic search, allowing the RAG agent to locate specific moments based on spoken dialogue or narration.

Transcription Engine: Faster-Whisper

Faster-Whisper is a reimplementation of OpenAI's Whisper model using CTranslate2. It is significantly faster than the original implementation while maintaining the same accuracy. In this architecture, the engine is responsible for:

Voice Activity Detection (VAD): Filtering out silence and non-speech segments.
Timestamp Generation: Providing precise [start, end] offsets for every sentence or phrase.
CUDA Acceleration: Utilizing NVIDIA GPUs to process long-form video audio in near real-time.

The Ingestion Pipeline

Transcription is an automated stage within the video ingestion workflow. When you submit a video via the API or UI, the system follows these steps:

Audio Extraction: The system strips the audio stream from the video source.
Asynchronous Processing: The transcription job is queued.
Chunking: The resulting transcript is broken into semantic segments (chunks) mapped to specific time intervals.

You can monitor the progress of this stage through the JobStatus schema:

// Example of a job currently in the transcribing stage
{
  "job_id": "job_01HXYZ123ABC",
  "status": "transcribing",
  "progress_percent": 45,
  "current_stage": "Extracting speech-to-text with timestamps",
  "message": "Processing audio stream..."
}

Searching Transcripts

Once indexed, audio chunks are stored in the vector database with a chunk_type of audio. When performing a search, the system retrieves these segments based on semantic similarity to the user's query.

Search Request Configuration

To specifically include or exclude audio data in your searches, use the include_audio flag in the search request:

curl -X POST http://localhost:8000/api/v1/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What did the speaker say about long-term goals?",
    "include_audio": true,
    "top_k": 5
  }'

Search Result Schema

Audio-based search results return the specific text fragment and the exact time it occurred:

{
  "video_id": "vid_98765",
  "chunk_id": "chunk_audio_001",
  "chunk_type": "audio",
  "timestamp_start": 42.5,
  "timestamp_end": 51.0,
  "content": "Our main objective for the next quarter is to stabilize the infrastructure...",
  "score": 0.89
}

Configuration and Optimization

The ASR behavior is primarily governed by the system's environment configuration. While the model size (e.g., base, small, medium, large-v3) can be configured in the backend settings, the default deployment is optimized for a balance between speed and accuracy.

| Setting | Description | |---------|-------------| | Device | Defaults to cuda for GPU acceleration. | | Compute Type | Optimized to float16 or int8_float16 depending on hardware support. | | Language Detection | Automatically detects the primary language of the video. |

[!TIP] For best results with ASR, ensure the video source has clear audio. High background noise or overlapping speakers may reduce timestamp precision, though semantic retrieval remains robust.