GPU & Hardware Setup
GPU & Hardware Setup
The Multimodal Video RAG system performs compute-intensive tasks, including Whisper-based audio transcription and vision-language model (LLaVA) inference. To achieve acceptable latency for video processing, an NVIDIA GPU with at least 8GB of VRAM is strongly recommended.
Prerequisites
Before starting the services, ensure your host machine has the following installed:
- NVIDIA Drivers: Version 535 or higher.
- Docker & Docker Compose: Docker Engine 24.0+.
- NVIDIA Container Toolkit: Required for Docker to interface with your GPU hardware.
1. Install NVIDIA Container Toolkit
If you haven't configured your system for GPU-accelerated containers, follow these steps to install the toolkit:
# Configure the repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install the toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# Restart Docker to apply changes
sudo systemctl restart docker
2. Verify GPU Accessibility
Run the following command to ensure Docker can successfully access your NVIDIA GPU:
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.0.1-base-ubuntu22.04 nvidia-smi
If configured correctly, this command will output a table showing your GPU model and driver version.
3. Docker Compose Configuration
The project utilizes the nvidia runtime within Docker Compose. The docker/docker-compose.dev.yml file is pre-configured to grant the backend and Ollama services access to the host GPU.
Key configuration used in the services:
services:
backend:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
4. Hardware Sizing & Model Requirements
The default models pulled during the Quick Start have the following approximate hardware requirements:
| Component | Model | VRAM Requirement | Acceleration | |-----------|-------|------------------|--------------| | ASR | Faster-Whisper (Large-v3) | ~4 GB | CUDA (CTranslate2) | | Vision | LLaVA:7b | ~4.5 GB | Ollama (CUDA) | | LLM | Llama 3.1:8b | ~5.5 GB | Ollama (CUDA) |
Note: If you have limited VRAM (e.g., 8GB), you may encounter "Out of Memory" (OOM) errors if the ASR and LLM services attempt to load into VRAM simultaneously. In such cases, consider using quantized models or offloading the LLM to a cloud provider by setting LLM_PROVIDER=openrouter in your .env file.
5. Troubleshooting CUDA Initialization
If the backend logs show CUDA_ERROR_NO_DEVICE or fallback to CPU processing:
- Check Environment Variables: Ensure
CUDA_VISIBLE_DEVICESis not being restricted in your shell. - Update Toolkit: Ensure the
nvidia-container-runtimeis the default runtime in/etc/docker/daemon.json:{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } } - Log Check: Run
docker logs ollamato verify if the LLaVA model successfully detected the GPU during initialization.