Homelab Prometheus and Node Exporter Setup
Prometheus is the backbone of modern infrastructure monitoring. While plenty of homelab guides show you how to run it in a Docker container with a basic config, this guide takes the approach most production environments use: native installation of Prometheus on a dedicated monitoring host, node_exporter deployed as a systemd service on every machine, and proper configuration of recording rules, alerting rules, and retention policies.
If you've already got Prometheus running in Docker and want to understand the deeper mechanics -- PromQL, recording rules, alert expressions, and how to manage storage as your homelab grows -- this is the guide for you.
Architecture Decisions
Before installing anything, make a few decisions about your monitoring setup:
Where to run Prometheus. Prometheus should run on a dedicated VM or host with local SSD storage. It's a time-series database that does constant writes, and it's CPU-intensive during rule evaluation. Don't co-locate it with your main application workloads if you can avoid it. A VM with 2 CPU cores, 4GB RAM, and 100GB SSD is a good starting point for monitoring 10-20 hosts.
How long to retain data. Prometheus stores data locally in its TSDB (Time-Series Database). The default retention is 15 days. For a homelab, 90 days is a practical balance between historical visibility and disk usage. A typical homelab with 10 hosts and standard exporters generates roughly 1-2GB per day of uncompressed TSDB data.
Scrape interval. The global scrape interval determines how often Prometheus pulls metrics from your targets. 15 seconds is the standard default and works well for most homelabs. Don't set it below 10 seconds unless you have a specific reason -- it increases storage consumption and puts more load on your exporters.
Installing Prometheus Natively
Download and install Prometheus from the official releases. This approach gives you more control than Docker and makes it easier to manage storage paths.
# Create a prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
# Create directories
sudo mkdir -p /etc/prometheus
sudo mkdir -p /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
# Download latest release (check github.com/prometheus/prometheus/releases)
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz
tar xzf prometheus-2.53.0.linux-amd64.tar.gz
# Install binaries
sudo cp prometheus-2.53.0.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.53.0.linux-amd64/promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool
# Install console templates
sudo cp -r prometheus-2.53.0.linux-amd64/consoles /etc/prometheus/
sudo cp -r prometheus-2.53.0.linux-amd64/console_libraries /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus
Create the main configuration file at /etc/prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
# Rule files for recording and alerting
rule_files:
- /etc/prometheus/rules/*.yml
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Scrape configurations
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node"
static_configs:
- targets:
- "10.0.20.10:9100" # proxmox-host
- "10.0.20.11:9100" # docker-host
- "10.0.20.12:9100" # nas
- "10.0.20.13:9100" # monitoring (self)
- "10.0.20.14:9100" # backup-server
labels:
env: "homelab"
relabel_configs:
- source_labels: [__address__]
regex: "10.0.20.10:.*"
target_label: hostname
replacement: "proxmox"
- source_labels: [__address__]
regex: "10.0.20.11:.*"
target_label: hostname
replacement: "docker-host"
- source_labels: [__address__]
regex: "10.0.20.12:.*"
target_label: hostname
replacement: "nas"
- source_labels: [__address__]
regex: "10.0.20.13:.*"
target_label: hostname
replacement: "monitoring"
- source_labels: [__address__]
regex: "10.0.20.14:.*"
target_label: hostname
replacement: "backup-server"
The relabel_configs section adds human-readable hostnames as labels, so your queries and dashboards show "proxmox" instead of "10.0.20.10:9100".
Create the systemd service at /etc/systemd/system/prometheus.service:
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=90d \
--storage.tsdb.retention.size=50GB \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.enable-lifecycle \
--web.enable-admin-api
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
The key flags:
--storage.tsdb.retention.time=90dkeeps 90 days of data--storage.tsdb.retention.size=50GBcaps storage at 50GB (whichever limit hits first wins)--web.enable-lifecycleallows config reloads via HTTP POST--web.enable-admin-apienables snapshot and delete APIs
Start Prometheus:
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl status prometheus
Verify it's running at http://monitoring-host:9090.
Deploying Node Exporter on All Hosts
Node exporter should run on every machine you want to monitor. Install it the same way on each host:
# Create user
sudo useradd --no-create-home --shell /bin/false node_exporter
# Download and install
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xzf node_exporter-1.8.2.linux-amd64.tar.gz
sudo cp node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
Create the systemd service at /etc/systemd/system/node_exporter.service:
[Unit]
Description=Prometheus Node Exporter
Documentation=https://github.com/prometheus/node_exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.filesystem.mount-points-exclude="^/(sys|proc|dev|host|etc)($$|/)" \
--web.listen-address=:9100
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Notable flags:
--collector.systemdexposes systemd service states as metrics (which services are running, failed, etc.)--collector.processesexposes per-process metrics- The filesystem exclusion prevents metrics for virtual filesystems that clutter your dashboards
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
Verify from the Prometheus host:
curl -s http://10.0.20.10:9100/metrics | head -20
After adding targets to your prometheus.yml, reload the config:
curl -X POST http://localhost:9090/-/reload
Check the targets page at http://monitoring-host:9090/targets to verify all endpoints show as "UP".
PromQL Fundamentals
PromQL is Prometheus's query language. Understanding it is the difference between having metrics you can look at and having metrics you can actually use. Here are the patterns that matter most for homelab monitoring.
Instant Vectors vs Range Vectors
An instant vector returns the most recent value for each time series:
node_cpu_seconds_total
A range vector returns all values within a time window:
node_cpu_seconds_total[5m]
Range vectors can't be graphed directly -- you need to apply a function like rate() to convert them to instant vectors.
CPU Usage
CPU metrics are counters (they only go up). To get usage as a percentage, use rate() to calculate per-second change, then subtract idle from 100%:
# CPU usage per host (all cores averaged)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Per-core CPU usage for a specific host
100 - (rate(node_cpu_seconds_total{hostname="proxmox", mode="idle"}[5m]) * 100)
Memory Usage
Memory metrics are gauges (they go up and down). No rate() needed:
# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Absolute memory used in GB
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 1024^3
Use MemAvailable rather than MemFree. Available includes buffers and cache that the kernel will release under memory pressure. Free is just unused pages.
Disk Usage
# Disk usage percentage per mount point
100 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes * 100)
# Disk I/O rate (bytes per second)
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
# Disk I/O latency (average seconds per operation)
rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m])
Network
# Network throughput (bytes per second)
rate(node_network_receive_bytes_total{device!~"lo|veth.*|br.*|docker.*"}[5m])
rate(node_network_transmit_bytes_total{device!~"lo|veth.*|br.*|docker.*"}[5m])
# Network errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])
The device filter excludes loopback, veth pairs (Docker), and bridge interfaces that generate noise in dashboards.
Aggregation
# Total memory across all hosts
sum(node_memory_MemTotal_bytes) / 1024^3
# Average CPU across all hosts
avg(100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))
# Top 5 hosts by CPU usage
topk(5, 100 - (avg by (hostname) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))
Recording Rules
Recording rules pre-compute frequently used expressions and store the results as new time series. This reduces query load on Prometheus and makes dashboards load faster.
Create /etc/prometheus/rules/recording.yml:
groups:
- name: node_recording_rules
interval: 30s
rules:
# CPU usage per host
- record: instance:node_cpu_utilization:ratio
expr: >
1 - avg by (instance, hostname) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
# Memory usage per host
- record: instance:node_memory_utilization:ratio
expr: >
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Disk usage per mount
- record: instance:node_filesystem_utilization:ratio
expr: >
1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes)
# Network receive rate per host
- record: instance:node_network_receive_bytes:rate5m
expr: >
sum by (instance, hostname) (rate(node_network_receive_bytes_total{device!~"lo|veth.*"}[5m]))
# Network transmit rate per host
- record: instance:node_network_transmit_bytes:rate5m
expr: >
sum by (instance, hostname) (rate(node_network_transmit_bytes_total{device!~"lo|veth.*"}[5m]))
The naming convention level:metric:operations is the Prometheus standard. Use it consistently.
Validate the rules before reloading:
promtool check rules /etc/prometheus/rules/recording.yml
Then reload:
curl -X POST http://localhost:9090/-/reload
Now your Grafana dashboards can query instance:node_cpu_utilization:ratio instead of recalculating the full rate expression every time.
Alerting Rules
Alerting rules define conditions that trigger alerts. When an alert fires, Prometheus sends it to Alertmanager, which handles deduplication, grouping, routing, and notification delivery.
Create /etc/prometheus/rules/alerts.yml:
groups:
- name: node_alerts
rules:
# Host is down
- alert: HostDown
expr: up{job="node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Host {{ $labels.hostname }} is down"
description: "Node exporter on {{ $labels.hostname }} ({{ $labels.instance }}) has been unreachable for 2 minutes."
# High CPU usage
- alert: HighCPU
expr: instance:node_cpu_utilization:ratio > 0.90
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.hostname }}"
description: "CPU usage on {{ $labels.hostname }} has been above 90% for 10 minutes. Current: {{ $value | humanizePercentage }}"
# High memory usage
- alert: HighMemory
expr: instance:node_memory_utilization:ratio > 0.90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory on {{ $labels.hostname }}"
description: "Memory usage on {{ $labels.hostname }} has been above 90% for 5 minutes. Current: {{ $value | humanizePercentage }}"
# Disk space critical
- alert: DiskSpaceCritical
expr: instance:node_filesystem_utilization:ratio > 0.90
for: 5m
labels:
severity: critical
annotations:
summary: "Disk almost full on {{ $labels.hostname }}"
description: "Filesystem {{ $labels.mountpoint }} on {{ $labels.hostname }} is {{ $value | humanizePercentage }} full."
# Disk space warning
- alert: DiskSpaceWarning
expr: instance:node_filesystem_utilization:ratio > 0.80
for: 30m
labels:
severity: warning
annotations:
summary: "Disk space low on {{ $labels.hostname }}"
description: "Filesystem {{ $labels.mountpoint }} on {{ $labels.hostname }} is {{ $value | humanizePercentage }} full."
# Systemd service failed
- alert: SystemdServiceFailed
expr: node_systemd_unit_state{state="failed"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Systemd service failed on {{ $labels.hostname }}"
description: "Service {{ $labels.name }} has been in failed state for 5 minutes."
# High disk I/O latency
- alert: HighDiskLatency
expr: >
rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m]) > 0.1
or
rate(node_disk_write_time_seconds_total[5m]) / rate(node_disk_writes_completed_total[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "High disk latency on {{ $labels.hostname }}"
description: "Disk {{ $labels.device }} on {{ $labels.hostname }} has >100ms average I/O latency."
# Predictive disk fill (linear extrapolation)
- alert: DiskWillFillIn24h
expr: >
predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24*3600) < 0
for: 1h
labels:
severity: warning
annotations:
summary: "Disk filling rapidly on {{ $labels.hostname }}"
description: "At current growth rate, {{ $labels.mountpoint }} on {{ $labels.hostname }} will be full within 24 hours."
The for duration is important. It prevents flapping -- the alert only fires if the condition has been true continuously for the specified duration. For a host being down, 2 minutes avoids alerting on brief network blips. For CPU, 10 minutes avoids alerting on legitimate load spikes.
Validate and reload:
promtool check rules /etc/prometheus/rules/alerts.yml
curl -X POST http://localhost:9090/-/reload
Retention Management
As your homelab grows, Prometheus storage grows with it. Here's how to manage it.
Understanding Storage Consumption
Check current TSDB stats via the Prometheus API:
curl -s http://localhost:9090/api/v1/status/tsdb | python3 -m json.tool
This shows the number of series, chunks, and blocks. The key metrics:
# Total TSDB size on disk
prometheus_tsdb_storage_size_bytes
# Number of active time series
prometheus_tsdb_head_series
# Samples ingested per second
rate(prometheus_tsdb_head_samples_appended_total[5m])
Controlling Growth
Reduce cardinality. The number of unique time series is the biggest factor in storage consumption. Each unique combination of metric name and label values is a separate series. If you have 10 hosts, each with 8 CPU cores, node_cpu_seconds_total creates 10 * 8 * 8 (modes) = 640 time series just for CPU. Multiply that by every metric and you can easily have 100,000+ series.
To reduce cardinality, drop metrics you don't use with metric_relabel_configs:
scrape_configs:
- job_name: "node"
static_configs:
- targets: ["10.0.20.10:9100"]
metric_relabel_configs:
# Drop Go runtime metrics from node_exporter
- source_labels: [__name__]
regex: "go_.*"
action: drop
# Drop unused filesystem metrics
- source_labels: [__name__]
regex: "node_filesystem_(device_error|readonly|files|files_free)"
action: drop
Set retention limits. The two retention flags work as an OR -- whichever limit is hit first causes compaction:
--storage.tsdb.retention.time=90d # Delete data older than 90 days
--storage.tsdb.retention.size=50GB # Delete oldest data when TSDB exceeds 50GB
Compact and clean up. Prometheus compacts its TSDB automatically, but you can trigger a manual compaction via the admin API:
curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
Long-Term Storage
If you need metrics beyond your retention window, consider remote write to a long-term storage backend like Thanos, Mimir, or VictoriaMetrics. For most homelabs, 90 days of local retention is sufficient and the complexity of a long-term storage backend isn't worth it.
Verifying Your Setup
After everything is configured, verify the full stack:
# Check Prometheus config
promtool check config /etc/prometheus/prometheus.yml
# Check all rules
promtool check rules /etc/prometheus/rules/*.yml
# Verify targets are up
curl -s http://localhost:9090/api/v1/targets | python3 -c "
import json, sys
data = json.load(sys.stdin)
for target in data['data']['activeTargets']:
print(f\"{target['labels'].get('hostname', target['scrapeUrl'])}: {target['health']}\")
"
# Check for firing alerts
curl -s http://localhost:9090/api/v1/alerts | python3 -m json.tool
All targets should show health: up. If any show down, check network connectivity (can the Prometheus host reach the target on port 9100?) and that node_exporter is running on the target.
With this setup -- native Prometheus, node_exporter on every host, recording rules for common queries, alerting rules for critical conditions, and sensible retention -- you have a production-grade monitoring foundation that will scale with your homelab. The next step is connecting Grafana and building dashboards, but that's a topic for another guide.