Audio Transcription (ASR)
Audio Transcription (ASR)
The system utilizes Faster-Whisper to convert video audio tracks into high-fidelity, timestamped text. This component is critical for semantic search, allowing the RAG agent to locate specific moments based on spoken dialogue or narration.
Transcription Engine: Faster-Whisper
Faster-Whisper is a reimplementation of OpenAI's Whisper model using CTranslate2. It is significantly faster than the original implementation while maintaining the same accuracy. In this architecture, the engine is responsible for:
- Voice Activity Detection (VAD): Filtering out silence and non-speech segments.
- Timestamp Generation: Providing precise
[start, end]offsets for every sentence or phrase. - CUDA Acceleration: Utilizing NVIDIA GPUs to process long-form video audio in near real-time.
The Ingestion Pipeline
Transcription is an automated stage within the video ingestion workflow. When you submit a video via the API or UI, the system follows these steps:
- Audio Extraction: The system strips the audio stream from the video source.
- Asynchronous Processing: The transcription job is queued.
- Chunking: The resulting transcript is broken into semantic segments (chunks) mapped to specific time intervals.
You can monitor the progress of this stage through the JobStatus schema:
// Example of a job currently in the transcribing stage
{
"job_id": "job_01HXYZ123ABC",
"status": "transcribing",
"progress_percent": 45,
"current_stage": "Extracting speech-to-text with timestamps",
"message": "Processing audio stream..."
}
Searching Transcripts
Once indexed, audio chunks are stored in the vector database with a chunk_type of audio. When performing a search, the system retrieves these segments based on semantic similarity to the user's query.
Search Request Configuration
To specifically include or exclude audio data in your searches, use the include_audio flag in the search request:
curl -X POST http://localhost:8000/api/v1/search \
-H "Content-Type: application/json" \
-d '{
"query": "What did the speaker say about long-term goals?",
"include_audio": true,
"top_k": 5
}'
Search Result Schema
Audio-based search results return the specific text fragment and the exact time it occurred:
{
"video_id": "vid_98765",
"chunk_id": "chunk_audio_001",
"chunk_type": "audio",
"timestamp_start": 42.5,
"timestamp_end": 51.0,
"content": "Our main objective for the next quarter is to stabilize the infrastructure...",
"score": 0.89
}
Configuration and Optimization
The ASR behavior is primarily governed by the system's environment configuration. While the model size (e.g., base, small, medium, large-v3) can be configured in the backend settings, the default deployment is optimized for a balance between speed and accuracy.
| Setting | Description |
|---------|-------------|
| Device | Defaults to cuda for GPU acceleration. |
| Compute Type | Optimized to float16 or int8_float16 depending on hardware support. |
| Language Detection | Automatically detects the primary language of the video. |
[!TIP] For best results with ASR, ensure the video source has clear audio. High background noise or overlapping speakers may reduce timestamp precision, though semantic retrieval remains robust.