GPU & Hardware Setup

The Multimodal Video RAG system performs compute-intensive tasks, including Whisper-based audio transcription and vision-language model (LLaVA) inference. To achieve acceptable latency for video processing, an NVIDIA GPU with at least 8GB of VRAM is strongly recommended.

Prerequisites

Before starting the services, ensure your host machine has the following installed:

NVIDIA Drivers: Version 535 or higher.
Docker & Docker Compose: Docker Engine 24.0+.
NVIDIA Container Toolkit: Required for Docker to interface with your GPU hardware.

1. Install NVIDIA Container Toolkit

If you haven't configured your system for GPU-accelerated containers, follow these steps to install the toolkit:

# Configure the repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install the toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Restart Docker to apply changes
sudo systemctl restart docker

2. Verify GPU Accessibility

Run the following command to ensure Docker can successfully access your NVIDIA GPU:

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.0.1-base-ubuntu22.04 nvidia-smi

If configured correctly, this command will output a table showing your GPU model and driver version.

3. Docker Compose Configuration

The project utilizes the nvidia runtime within Docker Compose. The docker/docker-compose.dev.yml file is pre-configured to grant the backend and Ollama services access to the host GPU.

Key configuration used in the services:

services:
  backend:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

4. Hardware Sizing & Model Requirements

The default models pulled during the Quick Start have the following approximate hardware requirements:

| Component | Model | VRAM Requirement | Acceleration | |-----------|-------|------------------|--------------| | ASR | Faster-Whisper (Large-v3) | ~4 GB | CUDA (CTranslate2) | | Vision | LLaVA:7b | ~4.5 GB | Ollama (CUDA) | | LLM | Llama 3.1:8b | ~5.5 GB | Ollama (CUDA) |

Note: If you have limited VRAM (e.g., 8GB), you may encounter "Out of Memory" (OOM) errors if the ASR and LLM services attempt to load into VRAM simultaneously. In such cases, consider using quantized models or offloading the LLM to a cloud provider by setting LLM_PROVIDER=openrouter in your .env file.

5. Troubleshooting CUDA Initialization

If the backend logs show CUDA_ERROR_NO_DEVICE or fallback to CPU processing:

Check Environment Variables: Ensure CUDA_VISIBLE_DEVICES is not being restricted in your shell.

Update Toolkit: Ensure the nvidia-container-runtime is the default runtime in /etc/docker/daemon.json:

{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}

Log Check: Run docker logs ollama to verify if the LLaVA model successfully detected the GPU during initialization.