Running Local AI in Your Homelab: GPU Setup for Private LLMs
Running AI models locally has become practical. A modern consumer GPU can run 7-billion-parameter models at usable speeds, and 70-billion-parameter models at slower but still workable speeds with enough VRAM. The appeal: complete privacy (no data leaves your network), no API costs, no rate limits, and the ability to run fine-tuned or uncensored models unavailable from cloud providers.
Photo by Edgar Cornejo on Unsplash
This guide covers setting up local AI inference in your homelab — hardware requirements, software stacks, and practical deployment patterns.

Hardware Requirements
VRAM: The Primary Bottleneck
Local LLM performance is primarily constrained by GPU VRAM. The model must fit in VRAM — anything that spills to system RAM runs at CPU speed, which is 10-50x slower.
VRAM requirements by model size:
| Model Size | Quantization | VRAM Required |
|---|---|---|
| 7B | Q4_K_M (4-bit) | ~5-6 GB |
| 7B | Q8_0 (8-bit) | ~8 GB |
| 13B | Q4_K_M | ~9-10 GB |
| 13B | Q8_0 | ~14 GB |
| 34B | Q4_K_M | ~20 GB |
| 70B | Q4_K_M | ~40 GB |
| 70B | Q8_0 | ~75 GB |
Quantization trades quality for memory efficiency. Q4_K_M is the practical sweet spot — minimal quality loss, 50% of the full-precision size.
GPU Recommendations by Budget
VRAM ≥ 8GB (runs 7B models well):
- RTX 3060 12GB (~$250-300 used) — 12GB VRAM is a sweet spot
- RTX 4060 Ti 16GB (~$400) — 16GB, newer architecture
- AMD RX 6800 XT 16GB (~$300 used) — good ROCm support
VRAM ≥ 16GB (runs 13B, handles 34B with quality tradeoffs):
- RTX 3090 24GB (~$500-600 used) — 24GB at consumer pricing
- RTX 4090 24GB (~$1,600) — fastest consumer card, 24GB
- AMD RX 7900 XTX 24GB (~$900) — 24GB, competitive ROCm support
VRAM ≥ 40GB (runs 70B models):
- 2× RTX 3090 (tensor parallel) — ~$1,000-1,200 total
- A100 40GB/80GB (data center, expensive used)
- Mac M3 Max (192GB unified memory — but slow compared to NVIDIA)
For most homelab use — code assistance, document Q&A, summarization — a 7B model on an RTX 3060 12GB is excellent. 13B models on 16GB GPUs produce noticeably better output for complex reasoning tasks.
CPU-Only Inference
Running without a GPU is possible via llama.cpp with CPU optimizations (AVX2, AVX-512). A modern 8-core CPU can run:
- 7B Q4: ~5-15 tokens/second (readable but slow for interactive use)
- 13B Q4: ~2-8 tokens/second (slow)
CPU inference is practical for offline summarization of long documents, batch processing, or testing — not for interactive chat.
Software Stack Options
Ollama (Recommended for Most Users)
Ollama is the easiest path to local LLM inference. It handles model downloads, GPU detection, and provides a simple API.
Installation on Linux:
curl -fsSL https://ollama.com/install.sh | sh
Pull and run a model:
# Pull Llama 3.1 8B (good general-purpose model):
ollama pull llama3.1:8b
# Interactive chat:
ollama run llama3.1:8b
# List downloaded models:
ollama list
API usage (compatible with OpenAI API format):
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Explain VLAN segmentation briefly"}]
}'
Ollama automatically detects NVIDIA GPUs (via CUDA) and AMD GPUs (via ROCm on Linux). On macOS, it uses Metal for M-series GPU acceleration.
Docker Deployment (NVIDIA)
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
volumes:
- open-webui-data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama-data:
open-webui-data:
Open WebUI provides a ChatGPT-like interface for interacting with your local models.
Prerequisite: NVIDIA Container Toolkit must be installed:
# Ubuntu/Debian:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
AMD GPU Setup (ROCm)
AMD GPUs require ROCm for GPU acceleration. Support varies by model:
Supported (good ROCm support): RX 6000/7000 series, Instinct MI series
Limited support: RX 5000 series (may need HSA_OVERRIDE_GFX_VERSION)
# Install ROCm (Ubuntu 22.04):
sudo apt install -y rocm-dev rocm-libs rocminfo
# Verify GPU detection:
rocminfo | grep "gfx"
# Ollama auto-detects ROCm:
ollama run llama3.1:8b
For AMD GPUs not officially supported, set the HSA override:
# Example for RX 5700 XT (gfx1010):
HSA_OVERRIDE_GFX_VERSION=10.3.0 ollama run llama3.1:8b
llama.cpp (Advanced)
llama.cpp is the underlying C++ inference engine that Ollama uses. Running it directly gives more control:
# Clone and build with CUDA support:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
# Download a GGUF model (Llama 3.1 8B Q4):
# From huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF
# Run inference:
./build/bin/llama-cli \
-m ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-n 512 \
--gpu-layers 99 \
-p "What is a VLAN?"
--gpu-layers 99 offloads all layers to GPU. Reduce this number to split the model between GPU and CPU VRAM if needed.
Model Selection
Code assistance:
codellama:13b— Meta's code-focused model, good for Python/JS/Godeepseek-coder-v2:16b— Strong coding performance, 16GB VRAM requiredqwen2.5-coder:7b— Surprisingly capable 7B code model
General chat/reasoning:
llama3.1:8b— Good balance of quality and speed for 8GB VRAMllama3.1:70b— Much better reasoning, requires 40GB VRAMmistral:7b— Fast, good for instruction followinggemma2:9b— Google's 9B model, strong at reasoning
Embeddings (for RAG applications):
nomic-embed-text— Fast, 768-dim embeddingsmxbai-embed-large— Higher quality, 1024-dim
Pull via Ollama: ollama pull modelname:tag
Like what you're reading? Subscribe to HomeLab Starter — free weekly guides in your inbox.
Practical Homelab Use Cases
1. Code Assistant (Continue.dev + Ollama)
Continue.dev is a VS Code extension that connects to local Ollama models for code completion and chat:
// ~/.continue/config.json
{
"models": [{
"title": "Llama 3.1 8B",
"provider": "ollama",
"model": "llama3.1:8b"
}],
"tabAutocompleteModel": {
"title": "DeepSeek Coder",
"provider": "ollama",
"model": "deepseek-coder-v2:16b"
}
}
2. Document Q&A (RAG with Ollama + Chroma)
Use a retrieval-augmented generation (RAG) pipeline to query your homelab documentation:
- Embedding model:
ollama pull nomic-embed-text - Vector store: Chroma (runs in Docker)
- Frontend: Open WebUI's document upload feature
Open WebUI natively supports RAG — upload PDFs, text files, or entire document libraries, then query them with natural language.
3. Automation with LangChain
from langchain_ollama import OllamaLLM
llm = OllamaLLM(model="llama3.1:8b", base_url="http://your-server:11434")
response = llm.invoke("Summarize the key steps to configure WireGuard on OpenWrt")
print(response)
Performance Optimization
Context length tradeoffs: Longer context = more VRAM. For interactive chat, 4K context is sufficient. Only increase for long-document tasks.
Parallel requests: Ollama handles concurrent requests but each consumes additional VRAM. On 12GB, running two simultaneous 7B sessions may cause VRAM overflow.
Flash Attention: Ollama enables Flash Attention automatically when available (CUDA 11.8+). Reduces VRAM for long contexts.
Quantization selection:
- Q2_K: Smallest, lowest quality (avoid for important tasks)
- Q4_K_M: Best balance for most use cases
- Q6_K: Near-full quality, 60% of full precision size
- Q8_0: Essentially identical to full precision, 50% size
Power Consumption
GPU inference draws significant power:
| GPU | Inference Power | Idle Power |
|---|---|---|
| RTX 3060 12GB | 130-170W | 10-15W |
| RTX 3090 | 250-320W | 15-20W |
| RTX 4090 | 350-450W | 15-25W |
| AMD RX 7900 XTX | 250-310W | 15-20W |
For a homelab server running inference occasionally, the power costs are manageable. For 24/7 inference, an always-on M-series Mac mini (20-40W total) may be more economical than an NVIDIA GPU workstation.
Getting Started
The fastest path:
- Check if your GPU is NVIDIA or AMD and its VRAM
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh - Pull a model:
ollama pull llama3.1:8b - Test it:
ollama run llama3.1:8b - Deploy Open WebUI via Docker for a proper interface
- Explore model variants based on your VRAM
Local AI inference has crossed the threshold from interesting experiment to practical daily tool for many homelab operators. A $300 used RTX 3090 and an afternoon of setup gets you a private, capable AI assistant with no recurring costs and no data leaving your network.
