Homelab Observability with Grafana LGTM Stack
Individual monitoring tools get you partway there. Prometheus shows you that CPU spiked at 3 AM. Loki shows you the error log that happened around the same time. But connecting metrics to logs to traces across your entire homelab requires all three signal types flowing into a unified platform where you can correlate them. That platform is the LGTM stack: Loki for logs, Grafana for visualization, Tempo for traces, and Mimir for metrics.
Photo by Shubham Dhage on Unsplash
This guide walks through deploying the complete LGTM stack on your homelab using Docker Compose, with Grafana Alloy as the unified collection agent. By the end, you will have a single observability platform where clicking on a metric spike takes you to the relevant logs and traces without switching tools or guessing at timestamps.

Why the Full LGTM Stack
If you already run Prometheus and Grafana, you might wonder why you need Mimir, Loki, and Tempo on top. The short answer is signal correlation. The longer answer involves understanding what each component brings that standalone tools cannot.
Mimir is a horizontally-scalable, long-term metrics store that is fully compatible with the Prometheus remote write API. You can keep running Prometheus as a scraper, but Mimir gives you multi-tenant isolation, cheaper long-term retention via object storage, and global query views across multiple Prometheus instances. For a homelab, the practical benefit is months of metric retention without Prometheus eating all your SSD space.
Loki stores logs indexed only by labels, not by full-text content. This makes it dramatically cheaper to run than Elasticsearch. You query logs using LogQL, which intentionally mirrors PromQL syntax, so you do not need to learn a completely different query language.
Tempo stores distributed traces using the same label-based approach as Loki. It is the cheapest trace backend to operate because it does not index trace content. Instead, it relies on trace IDs and service graph generation for discovery. For a homelab running microservices or multi-container applications, Tempo shows you exactly where requests spend their time.
Grafana ties them together. Its Explore view lets you jump from a metric panel to correlated logs to related traces in a single click. The data sources share label conventions, so job="nginx" in Mimir corresponds to {job="nginx"} in Loki.
LGTM vs. Standalone Tools Comparison
| Aspect | Standalone (Prometheus + ELK + Jaeger) | LGTM Stack |
|---|---|---|
| Query languages | PromQL + KQL + Jaeger UI | PromQL + LogQL (similar syntax) |
| Storage backend | Each has its own | Unified object storage for all |
| Collection agent | Multiple (node_exporter, Filebeat, OTEL) | Single (Grafana Alloy) |
| Correlation | Manual timestamp matching | Native exemplars + TraceQL links |
| Memory footprint | High (Elasticsearch alone needs 4-8 GB) | Moderate (Loki and Tempo are lightweight) |
| Configuration | Three different config formats | Consistent YAML + River config |
| Multi-tenancy | Varies by tool | Built-in across all components |
Prerequisites
Before deploying, make sure you have:
- A Linux host with at least 8 GB RAM (16 GB recommended for comfortable operation)
- Docker and Docker Compose v2 installed
- At least 50 GB of free disk space for data retention
- Basic familiarity with Prometheus concepts (scraping, labels, PromQL)
For production homelabs with multiple hosts, you will also want Alloy running on each machine, but we will start with a single-node deployment.
Architecture Decisions
Before writing any configuration, there are a few decisions to make.
Monolithic vs. Microservice Mode
Loki, Mimir, and Tempo each support two deployment modes. Monolithic mode runs all components in a single process. Microservice mode splits read and write paths into separate containers that scale independently.
For a homelab, use monolithic mode. Microservice mode is designed for multi-terabyte-per-day ingestion rates that no homelab will reach. Monolithic mode uses less memory, requires fewer containers, and is simpler to configure.
Storage Backend
All three backends can store data on the local filesystem or in object storage (S3-compatible). For a homelab:
- Local filesystem is fine if you have a single node and can tolerate data loss in a disk failure.
- MinIO (self-hosted S3) is better if you want to separate compute from storage or already run MinIO for other purposes.
This guide uses local filesystem storage for simplicity. Switching to MinIO later requires only changing the storage configuration blocks.
Retention Policy
Set retention based on your available disk space. A reasonable starting point:
- Metrics (Mimir): 90 days
- Logs (Loki): 30 days
- Traces (Tempo): 14 days
Traces are high-volume and low-value for historical analysis, so short retention is standard. Logs stay around longer for debugging. Metrics keep the longest because they compress well and trend analysis benefits from history.
Like what you're reading? Subscribe to HomeLab Starter — free weekly guides in your inbox.
Docker Compose Deployment
Create a directory for your LGTM stack:
mkdir -p ~/docker/lgtm-stack/{config,data/{mimir,loki,tempo,grafana}}
cd ~/docker/lgtm-stack
The Compose File
# ~/docker/lgtm-stack/docker-compose.yml
services:
mimir:
image: grafana/mimir:latest
container_name: mimir
restart: unless-stopped
command:
- -config.file=/etc/mimir/config.yaml
- -target=all
volumes:
- ./config/mimir.yaml:/etc/mimir/config.yaml:ro
- ./data/mimir:/data
ports:
- "9009:9009"
networks:
- lgtm
loki:
image: grafana/loki:latest
container_name: loki
restart: unless-stopped
command:
- -config.file=/etc/loki/config.yaml
- -target=all
volumes:
- ./config/loki.yaml:/etc/loki/config.yaml:ro
- ./data/loki:/loki
ports:
- "3100:3100"
networks:
- lgtm
tempo:
image: grafana/tempo:latest
container_name: tempo
restart: unless-stopped
command:
- -config.file=/etc/tempo/config.yaml
- -target=all
volumes:
- ./config/tempo.yaml:/etc/tempo/config.yaml:ro
- ./data/tempo:/var/tempo
ports:
- "3200:3200" # Tempo HTTP API
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
networks:
- lgtm
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
environment:
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_FEATURE_TOGGLES_ENABLE=traceqlEditor tempoSearch tempoBackendSearch
volumes:
- ./data/grafana:/var/lib/grafana
- ./config/grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml:ro
ports:
- "3000:3000"
depends_on:
- mimir
- loki
- tempo
networks:
- lgtm
alloy:
image: grafana/alloy:latest
container_name: alloy
restart: unless-stopped
command:
- run
- /etc/alloy/config.alloy
- --server.http.listen-addr=0.0.0.0:12345
- --storage.path=/var/lib/alloy/data
volumes:
- ./config/alloy.river:/etc/alloy/config.alloy:ro
- /var/log:/var/log:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
ports:
- "12345:12345"
depends_on:
- mimir
- loki
- tempo
networks:
- lgtm
pid: host
networks:
lgtm:
driver: bridge
Mimir Configuration
# ~/docker/lgtm-stack/config/mimir.yaml
target: all
multitenancy_enabled: false
server:
http_listen_port: 9009
log_level: warn
common:
storage:
backend: filesystem
filesystem:
dir: /data
blocks_storage:
storage_prefix: blocks
tsdb:
dir: /data/tsdb
retention_period: 90d
compactor:
data_dir: /data/compactor
sharding_ring:
kvstore:
store: memberlist
distributor:
ring:
kvstore:
store: memberlist
ingester:
ring:
kvstore:
store: memberlist
replication_factor: 1
store_gateway:
sharding_ring:
kvstore:
store: memberlist
limits:
max_global_series_per_user: 500000
ingestion_rate: 50000
ingestion_burst_size: 100000
ruler_storage:
backend: filesystem
filesystem:
dir: /data/rules
The key settings here: multitenancy_enabled: false simplifies authentication for a single-user homelab, replication_factor: 1 is correct for a single node, and the limits are generous enough for a homelab but prevent runaway cardinality.
Loki Configuration
# ~/docker/lgtm-stack/config/loki.yaml
auth_enabled: false
server:
http_listen_port: 3100
log_level: warn
common:
ring:
kvstore:
store: inmemory
replication_factor: 1
path_prefix: /loki
schema_config:
configs:
- from: "2024-01-01"
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: loki_index_
period: 24h
storage_config:
filesystem:
directory: /loki/chunks
limits_config:
retention_period: 720h # 30 days
max_query_series: 5000
max_query_parallelism: 4
compactor:
working_directory: /loki/compactor
retention_enabled: true
delete_request_store: filesystem
Loki v3 uses the TSDB store by default, which is significantly faster at query time than the older BoltDB index. The v13 schema enables the latest optimizations.
Tempo Configuration
# ~/docker/lgtm-stack/config/tempo.yaml
server:
http_listen_port: 3200
log_level: warn
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
storage:
trace:
backend: local
local:
path: /var/tempo/traces
wal:
path: /var/tempo/wal
compactor:
compaction:
block_retention: 336h # 14 days
metrics_generator:
registry:
external_labels:
source: tempo
cluster: homelab
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://mimir:9009/api/v1/push
send_exemplars: true
overrides:
defaults:
metrics_generator:
processors:
- service-graphs
- span-metrics
The metrics_generator section is critical for LGTM integration. Tempo generates RED metrics (rate, errors, duration) from traces and pushes them to Mimir. This means your trace data automatically creates metrics you can alert on, without instrumenting anything extra.
Grafana Data Source Provisioning
# ~/docker/lgtm-stack/config/grafana-datasources.yaml
apiVersion: 1
datasources:
- name: Mimir
type: prometheus
access: proxy
url: http://mimir:9009/prometheus
isDefault: true
jsonData:
httpMethod: POST
exemplarTraceIdDestinations:
- name: traceID
datasourceUid: tempo
- name: Loki
type: loki
access: proxy
url: http://loki:3100
jsonData:
derivedFields:
- name: TraceID
datasourceUid: tempo
matcherRegex: "traceID=(\\w+)"
url: "$${__value.raw}"
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
uid: tempo
jsonData:
tracesToLogs:
datasourceUid: loki
filterByTraceID: true
filterBySpanID: true
tracesToMetrics:
datasourceUid: mimir
serviceMap:
datasourceUid: mimir
nodeGraph:
enabled: true
This provisioning file is where the correlation happens. The exemplarTraceIdDestinations in Mimir links metric exemplars to Tempo traces. The derivedFields in Loki extract trace IDs from log lines and link to Tempo. The tracesToLogs and tracesToMetrics in Tempo link back to Loki and Mimir. Every signal type can navigate to the others.
Alloy Collection Configuration
// ~/docker/lgtm-stack/config/alloy.river
// ============================================
// Metrics Collection
// ============================================
// Scrape Alloy's own metrics
prometheus.scrape "alloy_self" {
targets = [{
__address__ = "localhost:12345",
job = "alloy",
}]
forward_to = [prometheus.remote_write.mimir.receiver]
}
// Discover and scrape Docker containers with prometheus labels
discovery.docker "containers" {
host = "unix:///var/run/docker.sock"
}
discovery.relabel "docker_metrics" {
targets = discovery.docker.containers.targets
rule {
source_labels = ["__meta_docker_container_label_prometheus_scrape"]
regex = "true"
action = "keep"
}
rule {
source_labels = ["__meta_docker_container_label_prometheus_port"]
target_label = "__address__"
regex = "(.*)"
replacement = "${1}"
}
rule {
source_labels = ["__meta_docker_container_name"]
target_label = "container"
}
}
prometheus.scrape "docker_containers" {
targets = discovery.relabel.docker_metrics.output
forward_to = [prometheus.remote_write.mimir.receiver]
}
// Node-level metrics (host PID namespace required)
prometheus.exporter.unix "node" {}
prometheus.scrape "node_metrics" {
targets = prometheus.exporter.unix.node.targets
forward_to = [prometheus.remote_write.mimir.receiver]
}
prometheus.remote_write "mimir" {
endpoint {
url = "http://mimir:9009/api/v1/push"
}
}
// ============================================
// Log Collection
// ============================================
// System logs (syslog/journal)
local.file_match "syslog" {
path_targets = [{
__address__ = "localhost",
__path__ = "/var/log/syslog",
job = "syslog",
host = env("HOSTNAME"),
}]
}
loki.source.file "syslog" {
targets = local.file_match.syslog.targets
forward_to = [loki.process.pipeline.receiver]
}
// Docker container logs
loki.source.docker "containers" {
host = "unix:///var/run/docker.sock"
targets = discovery.docker.containers.targets
forward_to = [loki.process.pipeline.receiver]
}
// Log processing pipeline
loki.process "pipeline" {
// Extract log level
stage.regex {
expression = "(?i)(?P<level>error|warn|info|debug)"
}
stage.labels {
values = { level = "" }
}
forward_to = [loki.write.loki.receiver]
}
loki.write "loki" {
endpoint {
url = "http://loki:3100/loki/api/v1/push"
}
}
// ============================================
// Trace Collection (OTLP receiver)
// ============================================
otelcol.receiver.otlp "default" {
grpc {
endpoint = "0.0.0.0:4327"
}
http {
endpoint = "0.0.0.0:4328"
}
output {
traces = [otelcol.processor.batch.default.input]
}
}
otelcol.processor.batch "default" {
output {
traces = [otelcol.exporter.otlp.tempo.input]
}
}
otelcol.exporter.otlp "tempo" {
client {
endpoint = "tempo:4317"
tls {
insecure = true
}
}
}
This Alloy configuration handles all three signal types in a single config file. Metrics are scraped from Docker containers and the host, logs are collected from Docker and syslog, and traces are received via OTLP and forwarded to Tempo.
Deploying the Stack
cd ~/docker/lgtm-stack
docker compose up -d
Verify all containers are healthy:
docker compose ps
You should see all five containers running. Check individual logs if anything fails:
docker compose logs mimir --tail=50
docker compose logs loki --tail=50
docker compose logs tempo --tail=50
Common startup issues:
- Mimir OOM: Reduce
max_global_series_per_userto 100000 - Loki permissions: Ensure
./data/lokiis writable by UID 10001 (Loki's container user) - Tempo WAL errors: Ensure
./data/tempoexists and is writable
Building Dashboards
Once the stack is running, access Grafana at http://your-host:3000 and log in with the admin password you set.
Node Overview Dashboard
Create a new dashboard and add these panels:
CPU Usage (Mimir data source):
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory Usage:
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
Disk I/O:
rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m])
Log Explorer
Navigate to Explore, select the Loki data source, and run:
{job="syslog"} |= "error" | logfmt | line_format "{{.msg}}"
This filters syslog entries containing "error", parses structured fields, and formats the output.
Trace Exploration
If your applications send OTLP traces, navigate to Explore with the Tempo data source. Use the Search tab to find traces by service name, duration, or status code. Click any trace to see the full span waterfall.
Correlating Signals
The real power of the LGTM stack shows up when you correlate. From a metric panel showing elevated error rates:
- Click the exemplar dots on the metric graph (small diamonds on the time series)
- Grafana jumps to the Tempo trace for that specific request
- From the trace view, click "Logs for this span" to see logs from that exact time window and service
This correlation requires your applications to include trace IDs in log output. Most OpenTelemetry SDKs do this automatically. For applications that log without trace context, the timestamp-based correlation in Grafana still works reasonably well.
Adding Applications to the Pipeline
Instrumenting with OpenTelemetry
For applications you control, add the OpenTelemetry SDK. Here is a Node.js example:
// tracing.js — load before your application
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://alloy:4327', // Alloy's OTLP gRPC endpoint
}),
instrumentations: [getNodeAutoInstrumentations()],
serviceName: 'my-homelab-app',
});
sdk.start();
Docker Labels for Auto-Discovery
Add labels to your Docker Compose services so Alloy automatically discovers and scrapes them:
services:
my-app:
image: my-app:latest
labels:
prometheus.scrape: "true"
prometheus.port: "my-app:8080"
Performance Tuning for Homelab Hardware
The LGTM stack is designed for large-scale deployments, so the defaults are often too aggressive for homelab hardware. Here are the adjustments that matter most.
Memory Limits
Set container memory limits to prevent any single component from consuming all available RAM:
services:
mimir:
deploy:
resources:
limits:
memory: 2g
loki:
deploy:
resources:
limits:
memory: 1g
tempo:
deploy:
resources:
limits:
memory: 1g
grafana:
deploy:
resources:
limits:
memory: 512m
Reducing Mimir Resource Usage
Add these to your Mimir config for lower memory consumption:
ingester:
ring:
kvstore:
store: memberlist
replication_factor: 1
max_transfer_retries: 0
compactor:
compaction_interval: 30m # less frequent than default
querier:
max_concurrent: 4 # limit parallel queries
Loki Chunk Tuning
ingester:
chunk_idle_period: 30m
chunk_retain_period: 1m
max_chunk_age: 2h
Larger chunks mean fewer index entries and better compression, at the cost of slightly higher memory usage during ingestion.
Alerting
Mimir supports Prometheus-compatible alerting rules. Create a rules file:
# ~/docker/lgtm-stack/config/rules/homelab-alerts.yaml
groups:
- name: homelab
interval: 1m
rules:
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space below 10% on {{ $labels.instance }}"
- alert: HighErrorRate
expr: sum(rate({job=~".+"} |= "error" [5m])) by (job) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate in {{ $labels.job }} logs"
Mount the rules directory into the Mimir container and add the ruler configuration:
# Add to mimir.yaml
ruler:
rule_path: /data/rules-temp
alertmanager_url: http://alertmanager:9093
ring:
kvstore:
store: memberlist
Maintenance and Upgrades
Backup Strategy
The critical data to back up:
- Grafana dashboards: Export as JSON or use provisioning files (version-controlled)
- Configuration files: Already in your
config/directory (version-control this) - Alert rules: Version-control alongside configurations
- Data directories: Optional. Metrics and logs can be re-collected, but historical data is nice to keep
Upgrading Components
The LGTM components follow a regular release cadence. To upgrade:
cd ~/docker/lgtm-stack
docker compose pull
docker compose up -d
Check the Grafana Labs changelog before upgrading. Mimir and Loki occasionally introduce breaking config changes between minor versions. Pin to specific versions in production:
services:
mimir:
image: grafana/mimir:2.14.0 # pin version
Monitoring the Monitors
Alloy's built-in UI at port 12345 shows the health of every pipeline component. Check it when something seems wrong. Additionally, Mimir, Loki, and Tempo all expose /ready and /metrics endpoints. Add these to your Alloy scrape config for meta-monitoring.
Next Steps
Once the base LGTM stack is running, consider:
- Adding more Alloy instances on other homelab machines to collect metrics and logs remotely
- Setting up Alertmanager for routing alerts to ntfy, Gotify, or email
- Creating SLO dashboards using Tempo's service graph metrics
- Enabling exemplars in your application metrics for direct metric-to-trace linking
- Adding synthetic monitoring with Grafana's k6 or Blackbox Exporter
The LGTM stack is a serious observability platform running on homelab hardware. It gives you the same tooling that companies use to monitor production systems at scale, and once configured, it mostly runs itself. The initial setup investment pays off every time you need to debug an issue and can correlate metrics, logs, and traces in a single pane of glass.
