← All articles
Close-up of a computer circuit board with many components.

Running Local AI in Your Homelab: GPU Setup for Private LLMs

AI 2026-03-04 · 5 min read local-ai ollama llm gpu homelab-ai llama
By HomeLab Starter Editorial TeamHome lab enthusiasts covering hardware setup, networking, and self-hosted services for home and small office environments.

Running AI models locally has become practical. A modern consumer GPU can run 7-billion-parameter models at usable speeds, and 70-billion-parameter models at slower but still workable speeds with enough VRAM. The appeal: complete privacy (no data leaves your network), no API costs, no rate limits, and the ability to run fine-tuned or uncensored models unavailable from cloud providers.

Photo by Edgar Cornejo on Unsplash

This guide covers setting up local AI inference in your homelab — hardware requirements, software stacks, and practical deployment patterns.

Ollama running a local LLM with GPU acceleration on a homelab server

Hardware Requirements

VRAM: The Primary Bottleneck

Local LLM performance is primarily constrained by GPU VRAM. The model must fit in VRAM — anything that spills to system RAM runs at CPU speed, which is 10-50x slower.

VRAM requirements by model size:

Model Size Quantization VRAM Required
7B Q4_K_M (4-bit) ~5-6 GB
7B Q8_0 (8-bit) ~8 GB
13B Q4_K_M ~9-10 GB
13B Q8_0 ~14 GB
34B Q4_K_M ~20 GB
70B Q4_K_M ~40 GB
70B Q8_0 ~75 GB

Quantization trades quality for memory efficiency. Q4_K_M is the practical sweet spot — minimal quality loss, 50% of the full-precision size.

GPU Recommendations by Budget

VRAM ≥ 8GB (runs 7B models well):

VRAM ≥ 16GB (runs 13B, handles 34B with quality tradeoffs):

VRAM ≥ 40GB (runs 70B models):

For most homelab use — code assistance, document Q&A, summarization — a 7B model on an RTX 3060 12GB is excellent. 13B models on 16GB GPUs produce noticeably better output for complex reasoning tasks.

CPU-Only Inference

Running without a GPU is possible via llama.cpp with CPU optimizations (AVX2, AVX-512). A modern 8-core CPU can run:

CPU inference is practical for offline summarization of long documents, batch processing, or testing — not for interactive chat.

Software Stack Options

Ollama (Recommended for Most Users)

Ollama is the easiest path to local LLM inference. It handles model downloads, GPU detection, and provides a simple API.

Installation on Linux:

curl -fsSL https://ollama.com/install.sh | sh

Pull and run a model:

# Pull Llama 3.1 8B (good general-purpose model):
ollama pull llama3.1:8b

# Interactive chat:
ollama run llama3.1:8b

# List downloaded models:
ollama list

API usage (compatible with OpenAI API format):

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Explain VLAN segmentation briefly"}]
  }'

Ollama automatically detects NVIDIA GPUs (via CUDA) and AMD GPUs (via ROCm on Linux). On macOS, it uses Metal for M-series GPU acceleration.

Docker Deployment (NVIDIA)

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - open-webui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama-data:
  open-webui-data:

Open WebUI provides a ChatGPT-like interface for interacting with your local models.

Prerequisite: NVIDIA Container Toolkit must be installed:

# Ubuntu/Debian:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

AMD GPU Setup (ROCm)

AMD GPUs require ROCm for GPU acceleration. Support varies by model:

Supported (good ROCm support): RX 6000/7000 series, Instinct MI series Limited support: RX 5000 series (may need HSA_OVERRIDE_GFX_VERSION)

# Install ROCm (Ubuntu 22.04):
sudo apt install -y rocm-dev rocm-libs rocminfo

# Verify GPU detection:
rocminfo | grep "gfx"

# Ollama auto-detects ROCm:
ollama run llama3.1:8b

For AMD GPUs not officially supported, set the HSA override:

# Example for RX 5700 XT (gfx1010):
HSA_OVERRIDE_GFX_VERSION=10.3.0 ollama run llama3.1:8b

llama.cpp (Advanced)

llama.cpp is the underlying C++ inference engine that Ollama uses. Running it directly gives more control:

# Clone and build with CUDA support:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

# Download a GGUF model (Llama 3.1 8B Q4):
# From huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF

# Run inference:
./build/bin/llama-cli \
  -m ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -n 512 \
  --gpu-layers 99 \
  -p "What is a VLAN?"

--gpu-layers 99 offloads all layers to GPU. Reduce this number to split the model between GPU and CPU VRAM if needed.

Model Selection

Code assistance:

General chat/reasoning:

Embeddings (for RAG applications):

Pull via Ollama: ollama pull modelname:tag

Like what you're reading? Subscribe to HomeLab Starter — free weekly guides in your inbox.

Practical Homelab Use Cases

1. Code Assistant (Continue.dev + Ollama)

Continue.dev is a VS Code extension that connects to local Ollama models for code completion and chat:

// ~/.continue/config.json
{
  "models": [{
    "title": "Llama 3.1 8B",
    "provider": "ollama",
    "model": "llama3.1:8b"
  }],
  "tabAutocompleteModel": {
    "title": "DeepSeek Coder",
    "provider": "ollama",
    "model": "deepseek-coder-v2:16b"
  }
}

2. Document Q&A (RAG with Ollama + Chroma)

Use a retrieval-augmented generation (RAG) pipeline to query your homelab documentation:

Open WebUI natively supports RAG — upload PDFs, text files, or entire document libraries, then query them with natural language.

3. Automation with LangChain

from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="llama3.1:8b", base_url="http://your-server:11434")

response = llm.invoke("Summarize the key steps to configure WireGuard on OpenWrt")
print(response)

Performance Optimization

Context length tradeoffs: Longer context = more VRAM. For interactive chat, 4K context is sufficient. Only increase for long-document tasks.

Parallel requests: Ollama handles concurrent requests but each consumes additional VRAM. On 12GB, running two simultaneous 7B sessions may cause VRAM overflow.

Flash Attention: Ollama enables Flash Attention automatically when available (CUDA 11.8+). Reduces VRAM for long contexts.

Quantization selection:

Power Consumption

GPU inference draws significant power:

GPU Inference Power Idle Power
RTX 3060 12GB 130-170W 10-15W
RTX 3090 250-320W 15-20W
RTX 4090 350-450W 15-25W
AMD RX 7900 XTX 250-310W 15-20W

For a homelab server running inference occasionally, the power costs are manageable. For 24/7 inference, an always-on M-series Mac mini (20-40W total) may be more economical than an NVIDIA GPU workstation.

Getting Started

The fastest path:

  1. Check if your GPU is NVIDIA or AMD and its VRAM
  2. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
  3. Pull a model: ollama pull llama3.1:8b
  4. Test it: ollama run llama3.1:8b
  5. Deploy Open WebUI via Docker for a proper interface
  6. Explore model variants based on your VRAM

Local AI inference has crossed the threshold from interesting experiment to practical daily tool for many homelab operators. A $300 used RTX 3090 and an afternoon of setup gets you a private, capable AI assistant with no recurring costs and no data leaving your network.

Get free weekly tips in your inbox. Subscribe to HomeLab Starter