Prometheus and Grafana: Metrics Monitoring for the Homelab
Uptime monitoring (is it up?) and metrics monitoring (how is it performing?) are different. Uptime Kuma handles uptime. Prometheus + Grafana handles everything else: CPU usage over time, memory trends, disk I/O, network throughput, container resource consumption. When something is slow or degrading, metrics tell you where.
Architecture
- Prometheus: Scrapes metrics from targets (exporters) at configurable intervals. Stores time-series data. Evaluates alert rules.
- Exporters: Translate system/service metrics into Prometheus format. Node Exporter for Linux system metrics; cAdvisor for Docker container metrics; SNMP Exporter for network devices.
- Grafana: Query Prometheus and visualize with dashboards. Pre-built dashboards exist for almost everything.
- Alertmanager: Routes alerts from Prometheus to email, Slack, PagerDuty, etc.
Docker Compose Stack
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert.rules.yml:/etc/prometheus/alert.rules.yml
- prometheus-data:/prometheus
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.time=30d
- --web.enable-lifecycle # Allow config reload via HTTP
ports:
- 9090:9090
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
volumes:
- grafana-data:/var/lib/grafana
environment:
GF_SECURITY_ADMIN_PASSWORD: change-this
GF_USERS_ALLOW_SIGN_UP: "false"
ports:
- 3000:3000
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- --path.procfs=/host/proc
- --path.rootfs=/rootfs
- --path.sysfs=/host/sys
- --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
network_mode: host # For accurate network metrics
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
devices:
- /dev/kmsg
ports:
- 8080:8080
privileged: true
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- 9093:9093
volumes:
prometheus-data:
grafana-data:
Prometheus Configuration
prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- "alert.rules.yml"
scrape_configs:
# Prometheus self-monitoring
- job_name: prometheus
static_configs:
- targets: ["localhost:9090"]
# Linux host metrics
- job_name: node
static_configs:
- targets:
- node-exporter:9100
- 192.168.1.51:9100 # Second host
- 192.168.1.52:9100 # Third host
# Docker container metrics
- job_name: cadvisor
static_configs:
- targets: ["cadvisor:8080"]
# Proxmox (via PVE exporter)
- job_name: proxmox
static_configs:
- targets: ["pve-exporter:9221"]
# Additional exporters...
For hosts not running Docker, install node_exporter as a systemd service:
# On each monitored host
wget https://github.com/prometheus/node_exporter/releases/latest/download/node_exporter-linux-amd64.tar.gz
tar xzf node_exporter*.tar.gz
sudo cp node_exporter-*/node_exporter /usr/local/bin/
sudo useradd -rs /bin/false node_exporter
# Create systemd service
cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=Node Exporter
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable --now node_exporter
Alert Rules
alert.rules.yml:
groups:
- name: homelab
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "{{ $labels.mountpoint }} has {{ $value }}% free"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
Alertmanager Configuration
alertmanager.yml:
global:
smtp_from: [email protected]
smtp_smarthost: smtp.example.com:587
smtp_auth_username: [email protected]
smtp_auth_password: smtp-password
route:
receiver: email-alerts
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: email-alerts
email_configs:
- to: [email protected]
subject: '[Homelab Alert] {{ .GroupLabels.alertname }}'
# Slack alternative
- name: slack-alerts
slack_configs:
- api_url: https://hooks.slack.com/services/...
channel: '#homelab-alerts'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
Grafana Setup
- Open
http://your-server:3000, log in with admin/your-password - Add Prometheus data source: Connections → Data Sources → Add → Prometheus → URL:
http://prometheus:9090
Pre-built dashboards (import by ID in Grafana → Dashboards → Import):
- 1860: Node Exporter Full — comprehensive Linux metrics
- 14282: Proxmox summary dashboard
- 193: Docker and system monitoring
- 11074: Node Exporter for Prometheus Dashboard EN
Import a dashboard:
- Grafana → Dashboards → New → Import
- Enter dashboard ID
- Select Prometheus data source
- Import
Proxmox Metrics Integration
Proxmox has built-in Prometheus metrics endpoint (requires enabling):
# On Proxmox host
apt install prometheus-pve-exporter
# Configure /etc/pve-exporter/config.yml
default:
user: prometheus@pve
password: monitoring-password
verify_ssl: false
Or use the built-in Proxmox metrics API (Proxmox 7.2+):
- Datacenter → Metric Server → Add → Prometheus
Retention and Storage
Default Prometheus retention is 15 days. For homelab use, 30-90 days is more useful:
command:
- --storage.tsdb.retention.time=90d
Approximately 5-15MB per monitored host per day at 15s scrape interval. A homelab with 5 hosts uses ~50-100MB/month.
PromQL Basics
Prometheus uses PromQL for queries:
# Current CPU idle percentage per host
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk usage by mount point
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100
# Network receive rate (bytes/sec)
irate(node_network_receive_bytes_total{device="eth0"}[5m])
# Docker container CPU usage
rate(container_cpu_usage_seconds_total{name=~".+"}[5m]) * 100
These form the basis of dashboard panels. Grafana's panel editor shows the resulting graph as you type.
