Vector Storage & Retrieval

Vector Storage Architecture

The system utilizes ChromaDB as its high-performance vector database to manage and retrieve multimodal embeddings. By indexing both audio transcripts and visual descriptions within the same vector space (or specialized collections), the system enables seamless natural language search across different media modalities.

Multimodal Indexing

Videos are processed into granular chunks, which are then embedded and stored with rich metadata. The system distinguishes between two primary types of content:

Data Schema

Each entry in the vector store includes the following metadata to ensure precise retrieval and timestamped playback:

video_id: Unique identifier for the source video.
chunk_id: Unique identifier for the specific segment.
chunk_type: Either audio or visual.
timestamp_start / timestamp_end: The temporal location (in seconds) within the video.
content: The raw text (transcript or visual description) that was embedded.

Search & Retrieval

Retrieval is handled via the /api/v1/search endpoint. The system performs a similarity search based on the vector distance between the user's natural language query and the stored embeddings.

Search Configuration

The search behavior can be tuned using the SearchRequest schema. This allows for filtering by specific videos or toggling content modalities.

Example Request:

{
  "query": "Where is the person using a laptop?",
  "top_k": 5,
  "include_visual": true,
  "include_audio": false,
  "video_ids": ["vid_01HXYZ123ABC"]
}

Retrieval Parameters

Search Result Payload

The system returns a list of SearchResult objects, ranked by their similarity score.

{
  "query": "elephants",
  "results": [
    {
      "video_id": "vid_123",
      "chunk_id": "chk_456",
      "chunk_type": "visual",
      "timestamp_start": 3.0,
      "timestamp_end": 12.0,
      "content": "A close up of an elephant moving its trunk near a fence.",
      "score": 0.89
    }
  ],
  "total_results": 1,
  "search_time_ms": 45
}

Embedding Configuration

The vector storage behavior is influenced by the selected LLM provider in your .env file.

Local Inference (ollama): Uses lightweight, local embedding models (e.g., all-minilm or those built into Llama 3.2 Vision) for privacy and zero cost.
Cloud Inference (openrouter): Routes embedding requests to high-performance cloud providers for increased semantic accuracy.

# Change the embedding/LLM provider
LLM_PROVIDER=ollama

Technical Lifecycle

Ingestion: During the embedding and indexing stages of the ingestion pipeline, text chunks are converted into high-dimensional vectors.
Persistence: Vectors and metadata are committed to the ChromaDB volume (defined in docker-compose.yml).
Querying: The SearchRequest is embedded using the same model, and a K-Nearest Neighbors (KNN) search is performed.
Ranking: Results are filtered by the top_k parameter and returned with a similarity score to the application layer.