Configuring Prometheus Alertmanager for Home Lab Monitoring

Automation 2026-02-09 · 6 min read alertmanager prometheus monitoring alerts notifications

Prometheus collects your metrics. Grafana visualizes them. But neither of those wakes you up at 2 AM when your NAS runs out of disk space. That's Alertmanager's job.

Grafana has its own alerting system, and for simple setups it works fine. But if you're running Prometheus as your metrics backbone, Alertmanager is the purpose-built tool for handling alerts: deduplication, grouping, routing to different receivers, silencing during maintenance, and inhibition rules that prevent alert storms. It's a dedicated piece of infrastructure that does one thing well.

This guide walks through installing Alertmanager, writing alert rules in Prometheus, configuring notification channels, and setting up the routing logic that makes alerts useful instead of noisy.

Architecture

The alert flow works like this:

Prometheus evaluates alert rules against your metrics on a regular interval
When a rule's condition is met, Prometheus fires the alert to Alertmanager
Alertmanager groups, deduplicates, and routes the alert to the appropriate receiver (Discord, email, Slack, webhook, etc.)
If the condition clears, Prometheus sends a resolved notification

Prometheus decides what to alert on. Alertmanager decides who gets notified and how.

Installing Alertmanager

Docker Compose

If you're already running Prometheus in Docker (as covered in the Grafana monitoring guide), add Alertmanager to the same compose file:

services:
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager-data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'

volumes:
  alertmanager-data:

Binary Install

Download from the Prometheus downloads page:

wget https://github.com/prometheus/alertmanager/releases/download/v0.28.1/alertmanager-0.28.1.linux-amd64.tar.gz
tar xzf alertmanager-0.28.1.linux-amd64.tar.gz
sudo cp alertmanager-0.28.1.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-0.28.1.linux-amd64/amtool /usr/local/bin/

Create a systemd service at /etc/systemd/system/alertmanager.service:

[Unit]
Description=Prometheus Alertmanager
After=network.target

[Service]
User=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
    --config.file=/etc/alertmanager/alertmanager.yml \
    --storage.path=/var/lib/alertmanager
Restart=always

[Install]
WantedBy=multi-user.target

sudo useradd --no-create-home --shell /bin/false alertmanager
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager

Configuring Alertmanager

Create /etc/alertmanager/alertmanager.yml (or ./alertmanager.yml for Docker):

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'discord'

  routes:
    - match:
        severity: critical
      receiver: 'discord'
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'discord'
      repeat_interval: 12h

receivers:
  - name: 'discord'
    discord_configs:
      - webhook_url: 'https://discord.com/api/webhooks/YOUR_WEBHOOK_ID/YOUR_WEBHOOK_TOKEN'
        title: '{{ .GroupLabels.alertname }}'
        message: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

The key concepts:

route — The top-level route defines default behavior. Child routes override it for matching alerts.
group_by — Alerts with the same alertname and instance labels are grouped into a single notification.
group_wait — Wait 30 seconds after the first alert in a group before sending, to catch any related alerts.
group_interval — After the first notification, wait 5 minutes before sending updates about new alerts in the same group.
repeat_interval — Re-send the same alert every 4 hours (1 hour for critical) if it's still firing.
receivers — Define where notifications go.

Connecting Prometheus to Alertmanager

Update your prometheus.yml to point at Alertmanager and load alert rules:

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 'alertmanager:9093'    # Docker service name
          # or '192.168.1.X:9093'  # Direct IP

rule_files:
  - 'alert_rules.yml'

Writing Alert Rules

Create alert_rules.yml alongside your prometheus.yml:

groups:
  - name: homelab
    rules:
      - alert: HostDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} has been unreachable for more than 2 minutes."

      - alert: HighDiskUsage
        expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk usage above 85% on {{ $labels.instance }}"
          description: "Root partition on {{ $labels.instance }} is at {{ $value | printf \"%.1f\" }}%."

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage above 90% on {{ $labels.instance }}"
          description: "{{ $labels.instance }} memory is at {{ $value | printf \"%.1f\" }}%."

      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage above 90% on {{ $labels.instance }}"
          description: "{{ $labels.instance }} CPU has been above 90% for 10 minutes."

      - alert: DiskWillFillIn24Hours
        expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "Disk on {{ $labels.instance }} predicted to fill within 24 hours"
          description: "Based on the last 6 hours of data, the root partition on {{ $labels.instance }} will run out of space within 24 hours."

Each rule has:

expr — The PromQL expression that must be true to fire
for — How long the condition must be true before firing (prevents flapping)
labels — Added to the alert, used for routing in Alertmanager
annotations — Human-readable descriptions used in notifications

The DiskWillFillIn24Hours rule is worth calling out — it uses predict_linear to project disk usage forward. This catches gradual fills (like growing log files) hours before they become a crisis.

Notification Receivers

Discord

Create a webhook in your Discord server (Server Settings > Integrations > Webhooks), then use the config shown above. Discord is the most popular notification channel for homelabs.

Email (SMTP)

receivers:
  - name: 'email'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.gmail.com:587'
        auth_username: '[email protected]'
        auth_password: 'your-app-password'
        require_tls: true

receivers:
  - name: 'telegram'
    telegram_configs:
      - bot_token: 'YOUR_BOT_TOKEN'
        chat_id: YOUR_CHAT_ID
        message: '{{ range .Alerts }}{{ .Annotations.summary }}{{ "\n" }}{{ end }}'

Generic Webhook

For anything else — Home Assistant, custom scripts, Ntfy:

receivers:
  - name: 'webhook'
    webhook_configs:
      - url: 'http://192.168.1.100:8080/alertmanager'
        send_resolved: true

Silencing Alerts

When you're doing maintenance — rebooting a server, migrating VMs, replacing a disk — you don't want alerts firing for expected downtime. Alertmanager supports silences.

Via the Web UI

Open http://alertmanager-ip:9093 and click New Silence. Set matchers (e.g., instance = 192.168.1.50:9100), a duration, and a comment explaining why.

Via amtool

# Create a silence for 2 hours
amtool silence add --alertmanager.url=http://localhost:9093 \
    instance="192.168.1.50:9100" \
    --duration=2h \
    --comment="NAS disk replacement"

# List active silences
amtool silence query --alertmanager.url=http://localhost:9093

# Expire a silence early
amtool silence expire --alertmanager.url=http://localhost:9093 SILENCE_ID

Inhibition Rules

Inhibition prevents redundant alerts. If a host is completely down, you don't need separate alerts for high CPU, high disk, and high memory on that same host — the "host down" alert is sufficient.

inhibit_rules:
  - source_match:
      alertname: 'HostDown'
    target_match_re:
      alertname: 'High.*'
    equal: ['instance']

This says: if HostDown is firing for an instance, suppress any alert matching High.* for the same instance. Clean and logical.

Testing Your Setup

After configuring everything, verify the pipeline works:

Check Prometheus is loading rules: visit http://prometheus:9090/rules
Check Prometheus can reach Alertmanager: visit http://prometheus:9090/targets and look for the alertmanager target
Force-fire a test alert with amtool:

amtool alert add --alertmanager.url=http://localhost:9093 \
    alertname=TestAlert severity=warning instance=test \
    --annotation.summary="This is a test alert"

Check it arrives in your notification channel
Expire the test: it will auto-resolve, or you can silence it

Practical Tips

Start with few alerts and add more over time. Alert fatigue is real. If you're getting pinged every day for things that don't matter, you'll start ignoring everything. Begin with disk space, host down, and memory. Add more only when you've identified a real need.

Use for durations generously. A CPU spike to 95% for 30 seconds during a backup is normal. Setting for: 10m on CPU alerts prevents noise from transient spikes.

Keep repeat intervals reasonable. Getting the same "disk is full" alert every 30 minutes is exhausting. Once every 4 hours is enough for warnings. Critical alerts can repeat more frequently.

Set up resolved notifications. It's reassuring to get a "resolved" message after an alert clears. Most receivers support this with send_resolved: true.

Alertmanager is not glamorous infrastructure. It doesn't have pretty dashboards or satisfying visualizations. But it's the piece that turns your monitoring from passive observation into active awareness. The first time your phone buzzes with a disk space warning while you're away from your desk, you'll appreciate having it.