Homelab Prometheus and Node Exporter Setup

Monitoring 2026-02-14 · 9 min read prometheus node-exporter monitoring promql alerting metrics

Prometheus is the backbone of modern infrastructure monitoring. While plenty of homelab guides show you how to run it in a Docker container with a basic config, this guide takes the approach most production environments use: native installation of Prometheus on a dedicated monitoring host, node_exporter deployed as a systemd service on every machine, and proper configuration of recording rules, alerting rules, and retention policies.

If you've already got Prometheus running in Docker and want to understand the deeper mechanics -- PromQL, recording rules, alert expressions, and how to manage storage as your homelab grows -- this is the guide for you.

Prometheus monitoring architecture showing scrape targets, alertmanager, and Grafana

Architecture Decisions

Before installing anything, make a few decisions about your monitoring setup:

Where to run Prometheus. Prometheus should run on a dedicated VM or host with local SSD storage. It's a time-series database that does constant writes, and it's CPU-intensive during rule evaluation. Don't co-locate it with your main application workloads if you can avoid it. A VM with 2 CPU cores, 4GB RAM, and 100GB SSD is a good starting point for monitoring 10-20 hosts.

How long to retain data. Prometheus stores data locally in its TSDB (Time-Series Database). The default retention is 15 days. For a homelab, 90 days is a practical balance between historical visibility and disk usage. A typical homelab with 10 hosts and standard exporters generates roughly 1-2GB per day of uncompressed TSDB data.

Scrape interval. The global scrape interval determines how often Prometheus pulls metrics from your targets. 15 seconds is the standard default and works well for most homelabs. Don't set it below 10 seconds unless you have a specific reason -- it increases storage consumption and puts more load on your exporters.

Installing Prometheus Natively

Download and install Prometheus from the official releases. This approach gives you more control than Docker and makes it easier to manage storage paths.

# Create a prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus

# Create directories
sudo mkdir -p /etc/prometheus
sudo mkdir -p /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

# Download latest release (check github.com/prometheus/prometheus/releases)
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz
tar xzf prometheus-2.53.0.linux-amd64.tar.gz

# Install binaries
sudo cp prometheus-2.53.0.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.53.0.linux-amd64/promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool

# Install console templates
sudo cp -r prometheus-2.53.0.linux-amd64/consoles /etc/prometheus/
sudo cp -r prometheus-2.53.0.linux-amd64/console_libraries /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus

Create the main configuration file at /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

# Rule files for recording and alerting
rule_files:
  - /etc/prometheus/rules/*.yml

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093

# Scrape configurations
scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node"
    static_configs:
      - targets:
          - "10.0.20.10:9100"   # proxmox-host
          - "10.0.20.11:9100"   # docker-host
          - "10.0.20.12:9100"   # nas
          - "10.0.20.13:9100"   # monitoring (self)
          - "10.0.20.14:9100"   # backup-server
        labels:
          env: "homelab"
    relabel_configs:
      - source_labels: [__address__]
        regex: "10.0.20.10:.*"
        target_label: hostname
        replacement: "proxmox"
      - source_labels: [__address__]
        regex: "10.0.20.11:.*"
        target_label: hostname
        replacement: "docker-host"
      - source_labels: [__address__]
        regex: "10.0.20.12:.*"
        target_label: hostname
        replacement: "nas"
      - source_labels: [__address__]
        regex: "10.0.20.13:.*"
        target_label: hostname
        replacement: "monitoring"
      - source_labels: [__address__]
        regex: "10.0.20.14:.*"
        target_label: hostname
        replacement: "backup-server"

The relabel_configs section adds human-readable hostnames as labels, so your queries and dashboards show "proxmox" instead of "10.0.20.10:9100".

Create the systemd service at /etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=90d \
  --storage.tsdb.retention.size=50GB \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.enable-lifecycle \
  --web.enable-admin-api
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

The key flags:

--storage.tsdb.retention.time=90d keeps 90 days of data
--storage.tsdb.retention.size=50GB caps storage at 50GB (whichever limit hits first wins)
--web.enable-lifecycle allows config reloads via HTTP POST
--web.enable-admin-api enables snapshot and delete APIs

Start Prometheus:

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl status prometheus

Verify it's running at http://monitoring-host:9090.

Deploying Node Exporter on All Hosts

Node exporter should run on every machine you want to monitor. Install it the same way on each host:

# Create user
sudo useradd --no-create-home --shell /bin/false node_exporter

# Download and install
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xzf node_exporter-1.8.2.linux-amd64.tar.gz
sudo cp node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

Create the systemd service at /etc/systemd/system/node_exporter.service:

[Unit]
Description=Prometheus Node Exporter
Documentation=https://github.com/prometheus/node_exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.filesystem.mount-points-exclude="^/(sys|proc|dev|host|etc)($$|/)" \
  --web.listen-address=:9100
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Notable flags:

--collector.systemd exposes systemd service states as metrics (which services are running, failed, etc.)
--collector.processes exposes per-process metrics
The filesystem exclusion prevents metrics for virtual filesystems that clutter your dashboards

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

Verify from the Prometheus host:

curl -s http://10.0.20.10:9100/metrics | head -20

After adding targets to your prometheus.yml, reload the config:

curl -X POST http://localhost:9090/-/reload

Check the targets page at http://monitoring-host:9090/targets to verify all endpoints show as "UP".

PromQL Fundamentals

PromQL is Prometheus's query language. Understanding it is the difference between having metrics you can look at and having metrics you can actually use. Here are the patterns that matter most for homelab monitoring.

Instant Vectors vs Range Vectors

An instant vector returns the most recent value for each time series:

node_cpu_seconds_total

A range vector returns all values within a time window:

node_cpu_seconds_total[5m]

Range vectors can't be graphed directly -- you need to apply a function like rate() to convert them to instant vectors.

CPU Usage

CPU metrics are counters (they only go up). To get usage as a percentage, use rate() to calculate per-second change, then subtract idle from 100%:

# CPU usage per host (all cores averaged)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Per-core CPU usage for a specific host
100 - (rate(node_cpu_seconds_total{hostname="proxmox", mode="idle"}[5m]) * 100)

Memory Usage

Memory metrics are gauges (they go up and down). No rate() needed:

# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Absolute memory used in GB
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 1024^3

Use MemAvailable rather than MemFree. Available includes buffers and cache that the kernel will release under memory pressure. Free is just unused pages.

Disk Usage

# Disk usage percentage per mount point
100 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes * 100)

# Disk I/O rate (bytes per second)
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# Disk I/O latency (average seconds per operation)
rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m])

Network

# Network throughput (bytes per second)
rate(node_network_receive_bytes_total{device!~"lo|veth.*|br.*|docker.*"}[5m])
rate(node_network_transmit_bytes_total{device!~"lo|veth.*|br.*|docker.*"}[5m])

# Network errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])

The device filter excludes loopback, veth pairs (Docker), and bridge interfaces that generate noise in dashboards.

Aggregation

# Total memory across all hosts
sum(node_memory_MemTotal_bytes) / 1024^3

# Average CPU across all hosts
avg(100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))

# Top 5 hosts by CPU usage
topk(5, 100 - (avg by (hostname) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))

Recording Rules

Recording rules pre-compute frequently used expressions and store the results as new time series. This reduces query load on Prometheus and makes dashboards load faster.

Create /etc/prometheus/rules/recording.yml:

groups:
  - name: node_recording_rules
    interval: 30s
    rules:
      # CPU usage per host
      - record: instance:node_cpu_utilization:ratio
        expr: >
          1 - avg by (instance, hostname) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

      # Memory usage per host
      - record: instance:node_memory_utilization:ratio
        expr: >
          1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

      # Disk usage per mount
      - record: instance:node_filesystem_utilization:ratio
        expr: >
          1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes)

      # Network receive rate per host
      - record: instance:node_network_receive_bytes:rate5m
        expr: >
          sum by (instance, hostname) (rate(node_network_receive_bytes_total{device!~"lo|veth.*"}[5m]))

      # Network transmit rate per host
      - record: instance:node_network_transmit_bytes:rate5m
        expr: >
          sum by (instance, hostname) (rate(node_network_transmit_bytes_total{device!~"lo|veth.*"}[5m]))

The naming convention level:metric:operations is the Prometheus standard. Use it consistently.

Validate the rules before reloading:

promtool check rules /etc/prometheus/rules/recording.yml

Then reload:

curl -X POST http://localhost:9090/-/reload

Now your Grafana dashboards can query instance:node_cpu_utilization:ratio instead of recalculating the full rate expression every time.

Alerting Rules

Alerting rules define conditions that trigger alerts. When an alert fires, Prometheus sends it to Alertmanager, which handles deduplication, grouping, routing, and notification delivery.

Create /etc/prometheus/rules/alerts.yml:

groups:
  - name: node_alerts
    rules:
      # Host is down
      - alert: HostDown
        expr: up{job="node"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.hostname }} is down"
          description: "Node exporter on {{ $labels.hostname }} ({{ $labels.instance }}) has been unreachable for 2 minutes."

      # High CPU usage
      - alert: HighCPU
        expr: instance:node_cpu_utilization:ratio > 0.90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.hostname }}"
          description: "CPU usage on {{ $labels.hostname }} has been above 90% for 10 minutes. Current: {{ $value | humanizePercentage }}"

      # High memory usage
      - alert: HighMemory
        expr: instance:node_memory_utilization:ratio > 0.90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory on {{ $labels.hostname }}"
          description: "Memory usage on {{ $labels.hostname }} has been above 90% for 5 minutes. Current: {{ $value | humanizePercentage }}"

      # Disk space critical
      - alert: DiskSpaceCritical
        expr: instance:node_filesystem_utilization:ratio > 0.90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk almost full on {{ $labels.hostname }}"
          description: "Filesystem {{ $labels.mountpoint }} on {{ $labels.hostname }} is {{ $value | humanizePercentage }} full."

      # Disk space warning
      - alert: DiskSpaceWarning
        expr: instance:node_filesystem_utilization:ratio > 0.80
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk space low on {{ $labels.hostname }}"
          description: "Filesystem {{ $labels.mountpoint }} on {{ $labels.hostname }} is {{ $value | humanizePercentage }} full."

      # Systemd service failed
      - alert: SystemdServiceFailed
        expr: node_systemd_unit_state{state="failed"} == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Systemd service failed on {{ $labels.hostname }}"
          description: "Service {{ $labels.name }} has been in failed state for 5 minutes."

      # High disk I/O latency
      - alert: HighDiskLatency
        expr: >
          rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m]) > 0.1
          or
          rate(node_disk_write_time_seconds_total[5m]) / rate(node_disk_writes_completed_total[5m]) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High disk latency on {{ $labels.hostname }}"
          description: "Disk {{ $labels.device }} on {{ $labels.hostname }} has >100ms average I/O latency."

      # Predictive disk fill (linear extrapolation)
      - alert: DiskWillFillIn24h
        expr: >
          predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24*3600) < 0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Disk filling rapidly on {{ $labels.hostname }}"
          description: "At current growth rate, {{ $labels.mountpoint }} on {{ $labels.hostname }} will be full within 24 hours."

The for duration is important. It prevents flapping -- the alert only fires if the condition has been true continuously for the specified duration. For a host being down, 2 minutes avoids alerting on brief network blips. For CPU, 10 minutes avoids alerting on legitimate load spikes.

Validate and reload:

promtool check rules /etc/prometheus/rules/alerts.yml
curl -X POST http://localhost:9090/-/reload

Retention Management

As your homelab grows, Prometheus storage grows with it. Here's how to manage it.

Understanding Storage Consumption

Check current TSDB stats via the Prometheus API:

curl -s http://localhost:9090/api/v1/status/tsdb | python3 -m json.tool

This shows the number of series, chunks, and blocks. The key metrics:

# Total TSDB size on disk
prometheus_tsdb_storage_size_bytes

# Number of active time series
prometheus_tsdb_head_series

# Samples ingested per second
rate(prometheus_tsdb_head_samples_appended_total[5m])

Controlling Growth

Reduce cardinality. The number of unique time series is the biggest factor in storage consumption. Each unique combination of metric name and label values is a separate series. If you have 10 hosts, each with 8 CPU cores, node_cpu_seconds_total creates 10 * 8 * 8 (modes) = 640 time series just for CPU. Multiply that by every metric and you can easily have 100,000+ series.

To reduce cardinality, drop metrics you don't use with metric_relabel_configs:

scrape_configs:
  - job_name: "node"
    static_configs:
      - targets: ["10.0.20.10:9100"]
    metric_relabel_configs:
      # Drop Go runtime metrics from node_exporter
      - source_labels: [__name__]
        regex: "go_.*"
        action: drop
      # Drop unused filesystem metrics
      - source_labels: [__name__]
        regex: "node_filesystem_(device_error|readonly|files|files_free)"
        action: drop

Set retention limits. The two retention flags work as an OR -- whichever limit is hit first causes compaction:

--storage.tsdb.retention.time=90d    # Delete data older than 90 days
--storage.tsdb.retention.size=50GB   # Delete oldest data when TSDB exceeds 50GB

Compact and clean up. Prometheus compacts its TSDB automatically, but you can trigger a manual compaction via the admin API:

curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones

Long-Term Storage

If you need metrics beyond your retention window, consider remote write to a long-term storage backend like Thanos, Mimir, or VictoriaMetrics. For most homelabs, 90 days of local retention is sufficient and the complexity of a long-term storage backend isn't worth it.

Verifying Your Setup

After everything is configured, verify the full stack:

# Check Prometheus config
promtool check config /etc/prometheus/prometheus.yml

# Check all rules
promtool check rules /etc/prometheus/rules/*.yml

# Verify targets are up
curl -s http://localhost:9090/api/v1/targets | python3 -c "
import json, sys
data = json.load(sys.stdin)
for target in data['data']['activeTargets']:
    print(f\"{target['labels'].get('hostname', target['scrapeUrl'])}: {target['health']}\")
"

# Check for firing alerts
curl -s http://localhost:9090/api/v1/alerts | python3 -m json.tool

All targets should show health: up. If any show down, check network connectivity (can the Prometheus host reach the target on port 9100?) and that node_exporter is running on the target.

With this setup -- native Prometheus, node_exporter on every host, recording rules for common queries, alerting rules for critical conditions, and sensible retention -- you have a production-grade monitoring foundation that will scale with your homelab. The next step is connecting Grafana and building dashboards, but that's a topic for another guide.