Vector Storage & Retrieval
Vector Storage Architecture
The system utilizes ChromaDB as its high-performance vector database to manage and retrieve multimodal embeddings. By indexing both audio transcripts and visual descriptions within the same vector space (or specialized collections), the system enables seamless natural language search across different media modalities.
Multimodal Indexing
Videos are processed into granular chunks, which are then embedded and stored with rich metadata. The system distinguishes between two primary types of content:
| Chunk Type | Source | Content Description |
| :--- | :--- | :--- |
| audio | Faster-Whisper ASR | Textual transcripts derived from the audio track. |
| visual | LLaVA / Llama 3.2 Vision | Descriptive summaries of visual events within a video segment. |
Data Schema
Each entry in the vector store includes the following metadata to ensure precise retrieval and timestamped playback:
video_id: Unique identifier for the source video.chunk_id: Unique identifier for the specific segment.chunk_type: Eitheraudioorvisual.timestamp_start/timestamp_end: The temporal location (in seconds) within the video.content: The raw text (transcript or visual description) that was embedded.
Search & Retrieval
Retrieval is handled via the /api/v1/search endpoint. The system performs a similarity search based on the vector distance between the user's natural language query and the stored embeddings.
Search Configuration
The search behavior can be tuned using the SearchRequest schema. This allows for filtering by specific videos or toggling content modalities.
Example Request:
{
"query": "Where is the person using a laptop?",
"top_k": 5,
"include_visual": true,
"include_audio": false,
"video_ids": ["vid_01HXYZ123ABC"]
}
Retrieval Parameters
| Parameter | Type | Default | Description |
| :--- | :--- | :--- | :--- |
| query | string | Required | The natural language string to search for. |
| top_k | int | 5 | Number of relevant chunks to return (Max: 20). |
| video_ids | list[str] | null | Optional list of IDs to restrict the search scope. |
| include_visual | bool | true | Whether to search across visual scene descriptions. |
| include_audio | bool | true | Whether to search across audio transcripts. |
Search Result Payload
The system returns a list of SearchResult objects, ranked by their similarity score.
{
"query": "elephants",
"results": [
{
"video_id": "vid_123",
"chunk_id": "chk_456",
"chunk_type": "visual",
"timestamp_start": 3.0,
"timestamp_end": 12.0,
"content": "A close up of an elephant moving its trunk near a fence.",
"score": 0.89
}
],
"total_results": 1,
"search_time_ms": 45
}
Embedding Configuration
The vector storage behavior is influenced by the selected LLM provider in your .env file.
- Local Inference (
ollama): Uses lightweight, local embedding models (e.g.,all-minilmor those built into Llama 3.2 Vision) for privacy and zero cost. - Cloud Inference (
openrouter): Routes embedding requests to high-performance cloud providers for increased semantic accuracy.
# Change the embedding/LLM provider
LLM_PROVIDER=ollama
Technical Lifecycle
- Ingestion: During the
embeddingandindexingstages of the ingestion pipeline, text chunks are converted into high-dimensional vectors. - Persistence: Vectors and metadata are committed to the ChromaDB volume (defined in
docker-compose.yml). - Querying: The
SearchRequestis embedded using the same model, and a K-Nearest Neighbors (KNN) search is performed. - Ranking: Results are filtered by the
top_kparameter and returned with a similarityscoreto the application layer.