Configuring Prometheus Alertmanager for Home Lab Monitoring
Prometheus collects your metrics. Grafana visualizes them. But neither of those wakes you up at 2 AM when your NAS runs out of disk space. That's Alertmanager's job.
Grafana has its own alerting system, and for simple setups it works fine. But if you're running Prometheus as your metrics backbone, Alertmanager is the purpose-built tool for handling alerts: deduplication, grouping, routing to different receivers, silencing during maintenance, and inhibition rules that prevent alert storms. It's a dedicated piece of infrastructure that does one thing well.
This guide walks through installing Alertmanager, writing alert rules in Prometheus, configuring notification channels, and setting up the routing logic that makes alerts useful instead of noisy.
Architecture
The alert flow works like this:
- Prometheus evaluates alert rules against your metrics on a regular interval
- When a rule's condition is met, Prometheus fires the alert to Alertmanager
- Alertmanager groups, deduplicates, and routes the alert to the appropriate receiver (Discord, email, Slack, webhook, etc.)
- If the condition clears, Prometheus sends a resolved notification
Prometheus decides what to alert on. Alertmanager decides who gets notified and how.
Installing Alertmanager
Docker Compose
If you're already running Prometheus in Docker (as covered in the Grafana monitoring guide), add Alertmanager to the same compose file:
services:
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager-data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
volumes:
alertmanager-data:
Binary Install
Download from the Prometheus downloads page:
wget https://github.com/prometheus/alertmanager/releases/download/v0.28.1/alertmanager-0.28.1.linux-amd64.tar.gz
tar xzf alertmanager-0.28.1.linux-amd64.tar.gz
sudo cp alertmanager-0.28.1.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-0.28.1.linux-amd64/amtool /usr/local/bin/
Create a systemd service at /etc/systemd/system/alertmanager.service:
[Unit]
Description=Prometheus Alertmanager
After=network.target
[Service]
User=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager
Restart=always
[Install]
WantedBy=multi-user.target
sudo useradd --no-create-home --shell /bin/false alertmanager
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
Configuring Alertmanager
Create /etc/alertmanager/alertmanager.yml (or ./alertmanager.yml for Docker):
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'discord'
routes:
- match:
severity: critical
receiver: 'discord'
repeat_interval: 1h
- match:
severity: warning
receiver: 'discord'
repeat_interval: 12h
receivers:
- name: 'discord'
discord_configs:
- webhook_url: 'https://discord.com/api/webhooks/YOUR_WEBHOOK_ID/YOUR_WEBHOOK_TOKEN'
title: '{{ .GroupLabels.alertname }}'
message: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
The key concepts:
- route — The top-level route defines default behavior. Child routes override it for matching alerts.
- group_by — Alerts with the same
alertnameandinstancelabels are grouped into a single notification. - group_wait — Wait 30 seconds after the first alert in a group before sending, to catch any related alerts.
- group_interval — After the first notification, wait 5 minutes before sending updates about new alerts in the same group.
- repeat_interval — Re-send the same alert every 4 hours (1 hour for critical) if it's still firing.
- receivers — Define where notifications go.
Connecting Prometheus to Alertmanager
Update your prometheus.yml to point at Alertmanager and load alert rules:
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093' # Docker service name
# or '192.168.1.X:9093' # Direct IP
rule_files:
- 'alert_rules.yml'
Writing Alert Rules
Create alert_rules.yml alongside your prometheus.yml:
groups:
- name: homelab
rules:
- alert: HostDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Host {{ $labels.instance }} is down"
description: "{{ $labels.instance }} has been unreachable for more than 2 minutes."
- alert: HighDiskUsage
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Disk usage above 85% on {{ $labels.instance }}"
description: "Root partition on {{ $labels.instance }} is at {{ $value | printf \"%.1f\" }}%."
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Memory usage above 90% on {{ $labels.instance }}"
description: "{{ $labels.instance }} memory is at {{ $value | printf \"%.1f\" }}%."
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 10m
labels:
severity: warning
annotations:
summary: "CPU usage above 90% on {{ $labels.instance }}"
description: "{{ $labels.instance }} CPU has been above 90% for 10 minutes."
- alert: DiskWillFillIn24Hours
expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 30m
labels:
severity: critical
annotations:
summary: "Disk on {{ $labels.instance }} predicted to fill within 24 hours"
description: "Based on the last 6 hours of data, the root partition on {{ $labels.instance }} will run out of space within 24 hours."
Each rule has:
- expr — The PromQL expression that must be true to fire
- for — How long the condition must be true before firing (prevents flapping)
- labels — Added to the alert, used for routing in Alertmanager
- annotations — Human-readable descriptions used in notifications
The DiskWillFillIn24Hours rule is worth calling out — it uses predict_linear to project disk usage forward. This catches gradual fills (like growing log files) hours before they become a crisis.
Notification Receivers
Discord
Create a webhook in your Discord server (Server Settings > Integrations > Webhooks), then use the config shown above. Discord is the most popular notification channel for homelabs.
Email (SMTP)
receivers:
- name: 'email'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.gmail.com:587'
auth_username: '[email protected]'
auth_password: 'your-app-password'
require_tls: true
Telegram
receivers:
- name: 'telegram'
telegram_configs:
- bot_token: 'YOUR_BOT_TOKEN'
chat_id: YOUR_CHAT_ID
message: '{{ range .Alerts }}{{ .Annotations.summary }}{{ "\n" }}{{ end }}'
Generic Webhook
For anything else — Home Assistant, custom scripts, Ntfy:
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://192.168.1.100:8080/alertmanager'
send_resolved: true
Silencing Alerts
When you're doing maintenance — rebooting a server, migrating VMs, replacing a disk — you don't want alerts firing for expected downtime. Alertmanager supports silences.
Via the Web UI
Open http://alertmanager-ip:9093 and click New Silence. Set matchers (e.g., instance = 192.168.1.50:9100), a duration, and a comment explaining why.
Via amtool
# Create a silence for 2 hours
amtool silence add --alertmanager.url=http://localhost:9093 \
instance="192.168.1.50:9100" \
--duration=2h \
--comment="NAS disk replacement"
# List active silences
amtool silence query --alertmanager.url=http://localhost:9093
# Expire a silence early
amtool silence expire --alertmanager.url=http://localhost:9093 SILENCE_ID
Inhibition Rules
Inhibition prevents redundant alerts. If a host is completely down, you don't need separate alerts for high CPU, high disk, and high memory on that same host — the "host down" alert is sufficient.
inhibit_rules:
- source_match:
alertname: 'HostDown'
target_match_re:
alertname: 'High.*'
equal: ['instance']
This says: if HostDown is firing for an instance, suppress any alert matching High.* for the same instance. Clean and logical.
Testing Your Setup
After configuring everything, verify the pipeline works:
- Check Prometheus is loading rules: visit
http://prometheus:9090/rules - Check Prometheus can reach Alertmanager: visit
http://prometheus:9090/targetsand look for the alertmanager target - Force-fire a test alert with amtool:
amtool alert add --alertmanager.url=http://localhost:9093 \
alertname=TestAlert severity=warning instance=test \
--annotation.summary="This is a test alert"
- Check it arrives in your notification channel
- Expire the test: it will auto-resolve, or you can silence it
Practical Tips
Start with few alerts and add more over time. Alert fatigue is real. If you're getting pinged every day for things that don't matter, you'll start ignoring everything. Begin with disk space, host down, and memory. Add more only when you've identified a real need.
Use for durations generously. A CPU spike to 95% for 30 seconds during a backup is normal. Setting for: 10m on CPU alerts prevents noise from transient spikes.
Keep repeat intervals reasonable. Getting the same "disk is full" alert every 30 minutes is exhausting. Once every 4 hours is enough for warnings. Critical alerts can repeat more frequently.
Set up resolved notifications. It's reassuring to get a "resolved" message after an alert clears. Most receivers support this with send_resolved: true.
Alertmanager is not glamorous infrastructure. It doesn't have pretty dashboards or satisfying visualizations. But it's the piece that turns your monitoring from passive observation into active awareness. The first time your phone buzzes with a disk space warning while you're away from your desk, you'll appreciate having it.