Running Local AI in Your Homelab: GPU Setup for Private LLMs

AI 2026-03-04 · 5 min read local-ai ollama llm gpu homelab-ai llama
By HomeLab Starter Editorial Team — Home lab enthusiasts covering hardware setup, networking, and self-hosted services for home and small office environments.

Running AI models locally has become practical. A modern consumer GPU can run 7-billion-parameter models at usable speeds, and 70-billion-parameter models at slower but still workable speeds with enough VRAM. The appeal: complete privacy (no data leaves your network), no API costs, no rate limits, and the ability to run fine-tuned or uncensored models unavailable from cloud providers.

Photo by Edgar Cornejo on Unsplash

This guide covers setting up local AI inference in your homelab — hardware requirements, software stacks, and practical deployment patterns.

Ollama running a local LLM with GPU acceleration on a homelab server

Hardware Requirements

VRAM: The Primary Bottleneck

Local LLM performance is primarily constrained by GPU VRAM. The model must fit in VRAM — anything that spills to system RAM runs at CPU speed, which is 10-50x slower.

VRAM requirements by model size:

Model Size	Quantization	VRAM Required
7B	Q4_K_M (4-bit)	~5-6 GB
7B	Q8_0 (8-bit)	~8 GB
13B	Q4_K_M	~9-10 GB
13B	Q8_0	~14 GB
34B	Q4_K_M	~20 GB
70B	Q4_K_M	~40 GB
70B	Q8_0	~75 GB

Quantization trades quality for memory efficiency. Q4_K_M is the practical sweet spot — minimal quality loss, 50% of the full-precision size.

GPU Recommendations by Budget

VRAM ≥ 8GB (runs 7B models well):

RTX 3060 12GB (~$250-300 used) — 12GB VRAM is a sweet spot
RTX 4060 Ti 16GB (~$400) — 16GB, newer architecture
AMD RX 6800 XT 16GB (~$300 used) — good ROCm support

VRAM ≥ 16GB (runs 13B, handles 34B with quality tradeoffs):

RTX 3090 24GB (~$500-600 used) — 24GB at consumer pricing
RTX 4090 24GB (~$1,600) — fastest consumer card, 24GB
AMD RX 7900 XTX 24GB (~$900) — 24GB, competitive ROCm support

VRAM ≥ 40GB (runs 70B models):

2× RTX 3090 (tensor parallel) — ~$1,000-1,200 total
A100 40GB/80GB (data center, expensive used)
Mac M3 Max (192GB unified memory — but slow compared to NVIDIA)

For most homelab use — code assistance, document Q&A, summarization — a 7B model on an RTX 3060 12GB is excellent. 13B models on 16GB GPUs produce noticeably better output for complex reasoning tasks.

CPU-Only Inference

Running without a GPU is possible via llama.cpp with CPU optimizations (AVX2, AVX-512). A modern 8-core CPU can run:

7B Q4: ~5-15 tokens/second (readable but slow for interactive use)
13B Q4: ~2-8 tokens/second (slow)

CPU inference is practical for offline summarization of long documents, batch processing, or testing — not for interactive chat.

Software Stack Options

Ollama (Recommended for Most Users)

Ollama is the easiest path to local LLM inference. It handles model downloads, GPU detection, and provides a simple API.

Installation on Linux:

curl -fsSL https://ollama.com/install.sh | sh

Pull and run a model:

# Pull Llama 3.1 8B (good general-purpose model):
ollama pull llama3.1:8b

# Interactive chat:
ollama run llama3.1:8b

# List downloaded models:
ollama list

API usage (compatible with OpenAI API format):

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Explain VLAN segmentation briefly"}]
  }'

Ollama automatically detects NVIDIA GPUs (via CUDA) and AMD GPUs (via ROCm on Linux). On macOS, it uses Metal for M-series GPU acceleration.

Docker Deployment (NVIDIA)

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - open-webui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama-data:
  open-webui-data:

Open WebUI provides a ChatGPT-like interface for interacting with your local models.

Prerequisite: NVIDIA Container Toolkit must be installed:

# Ubuntu/Debian:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

AMD GPU Setup (ROCm)

AMD GPUs require ROCm for GPU acceleration. Support varies by model:

Supported (good ROCm support): RX 6000/7000 series, Instinct MI series Limited support: RX 5000 series (may need HSA_OVERRIDE_GFX_VERSION)

# Install ROCm (Ubuntu 22.04):
sudo apt install -y rocm-dev rocm-libs rocminfo

# Verify GPU detection:
rocminfo | grep "gfx"

# Ollama auto-detects ROCm:
ollama run llama3.1:8b

For AMD GPUs not officially supported, set the HSA override:

# Example for RX 5700 XT (gfx1010):
HSA_OVERRIDE_GFX_VERSION=10.3.0 ollama run llama3.1:8b

llama.cpp (Advanced)

llama.cpp is the underlying C++ inference engine that Ollama uses. Running it directly gives more control:

# Clone and build with CUDA support:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

# Download a GGUF model (Llama 3.1 8B Q4):
# From huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF

# Run inference:
./build/bin/llama-cli \
  -m ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -n 512 \
  --gpu-layers 99 \
  -p "What is a VLAN?"

--gpu-layers 99 offloads all layers to GPU. Reduce this number to split the model between GPU and CPU VRAM if needed.

Model Selection

Code assistance:

codellama:13b — Meta's code-focused model, good for Python/JS/Go
deepseek-coder-v2:16b — Strong coding performance, 16GB VRAM required
qwen2.5-coder:7b — Surprisingly capable 7B code model

General chat/reasoning:

llama3.1:8b — Good balance of quality and speed for 8GB VRAM
llama3.1:70b — Much better reasoning, requires 40GB VRAM
mistral:7b — Fast, good for instruction following
gemma2:9b — Google's 9B model, strong at reasoning

Embeddings (for RAG applications):

nomic-embed-text — Fast, 768-dim embeddings
mxbai-embed-large — Higher quality, 1024-dim

Pull via Ollama: ollama pull modelname:tag

Want more ai guides? Get guides like this in your inbox — HomeLab Starter delivers one free deep-dive every week.

Practical Homelab Use Cases

1. Code Assistant (Continue.dev + Ollama)

Continue.dev is a VS Code extension that connects to local Ollama models for code completion and chat:

// ~/.continue/config.json
{
  "models": [{
    "title": "Llama 3.1 8B",
    "provider": "ollama",
    "model": "llama3.1:8b"
  }],
  "tabAutocompleteModel": {
    "title": "DeepSeek Coder",
    "provider": "ollama",
    "model": "deepseek-coder-v2:16b"
  }
}

2. Document Q&A (RAG with Ollama + Chroma)

Use a retrieval-augmented generation (RAG) pipeline to query your homelab documentation:

Embedding model: ollama pull nomic-embed-text
Vector store: Chroma (runs in Docker)
Frontend: Open WebUI's document upload feature

Open WebUI natively supports RAG — upload PDFs, text files, or entire document libraries, then query them with natural language.

3. Automation with LangChain

from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="llama3.1:8b", base_url="http://your-server:11434")

response = llm.invoke("Summarize the key steps to configure WireGuard on OpenWrt")
print(response)

Performance Optimization

Context length tradeoffs: Longer context = more VRAM. For interactive chat, 4K context is sufficient. Only increase for long-document tasks.

Parallel requests: Ollama handles concurrent requests but each consumes additional VRAM. On 12GB, running two simultaneous 7B sessions may cause VRAM overflow.

Flash Attention: Ollama enables Flash Attention automatically when available (CUDA 11.8+). Reduces VRAM for long contexts.

Quantization selection:

Q2_K: Smallest, lowest quality (avoid for important tasks)
Q4_K_M: Best balance for most use cases
Q6_K: Near-full quality, 60% of full precision size
Q8_0: Essentially identical to full precision, 50% size

Power Consumption

GPU inference draws significant power:

GPU	Inference Power	Idle Power
RTX 3060 12GB	130-170W	10-15W
RTX 3090	250-320W	15-20W
RTX 4090	350-450W	15-25W
AMD RX 7900 XTX	250-310W	15-20W

For a homelab server running inference occasionally, the power costs are manageable. For 24/7 inference, an always-on M-series Mac mini (20-40W total) may be more economical than an NVIDIA GPU workstation.

Getting Started

The fastest path:

Check if your GPU is NVIDIA or AMD and its VRAM
Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
Pull a model: ollama pull llama3.1:8b
Test it: ollama run llama3.1:8b
Deploy Open WebUI via Docker for a proper interface
Explore model variants based on your VRAM

Local AI inference has crossed the threshold from interesting experiment to practical daily tool for many homelab operators. A $300 used RTX 3090 and an afternoon of setup gets you a private, capable AI assistant with no recurring costs and no data leaving your network.