Creating a Disaster Recovery Plan for Your Homelab

Backup 2026-02-15 · 11 min read disaster-recovery backup planning documentation homelab resilience
By HomeLab Starter Editorial Team — Home lab enthusiasts covering hardware setup, networking, and self-hosted services for home and small office environments.

You have backups. Good. But if your server room floods tonight, can you actually recover? Do you know which services depend on which VMs? Do you have your encryption keys stored somewhere other than the machine they are protecting? Can you rebuild your entire lab from scratch and get critical services running within a day?

Photo by Martin Sanchez on Unsplash

If you hesitated on any of those questions, you need a disaster recovery plan. Not a vague intention to "restore from backups" -- an actual written plan that you have tested and know works.

Most homelabbers have some form of backup. Far fewer have tested those backups. And almost nobody has a documented, end-to-end recovery procedure. This guide walks you through building a real DR plan: documenting what you have, verifying your backups actually work, writing step-by-step recovery procedures, handling off-site storage, and -- critically -- testing the whole thing.

Why Homelabbers Need a DR Plan

Enterprise IT departments have formal DR plans because downtime costs money. Your homelab probably does not cost money when it is down, but it might hold things that are irreplaceable:

Family photos and videos that exist nowhere else
Personal documents (tax records, legal files, medical records)
Self-hosted services your household depends on (Home Assistant, Pi-hole, Nextcloud)
Configuration and automation that took months to build
Credentials and secrets needed to access other services

The threat model for a homelab includes:

Scenario	Likelihood	Impact
Accidental deletion (rm -rf, dropped database)	High	Low-Medium
Single disk failure	Medium	Low (if RAID/ZFS)
Server hardware failure (PSU, motherboard)	Medium	Medium-High
Failed update corrupts a service	Medium	Medium
Ransomware encrypts everything on the network	Low-Medium	Critical
Power surge / lightning strike	Low	High
Fire, flood, theft	Very Low	Critical

A good DR plan covers all of these, from the mundane "oops I deleted a container" to the worst case "everything is gone."

Step 1: Document Your Infrastructure

You cannot recover what you have not documented. Start by creating an inventory of everything in your lab.

Hardware Inventory

Create a file (a simple text file, a wiki page, or a spreadsheet) listing every piece of hardware:

# Hardware Inventory
Updated: 2026-02-15

## Server: pve-01 (Primary Proxmox Host)
- Hardware: Dell OptiPlex 7050 Micro
- CPU: Intel i7-7700T (4C/8T)
- RAM: 32 GB DDR4
- Storage: 512 GB NVMe (boot/VMs), 2 TB SATA SSD (data)
- Network: 1x 1GbE onboard, 1x USB 2.5GbE adapter
- IP: 10.0.20.10 (VLAN 20), IPMI: N/A
- Serial/Asset: SVC-TAG-HERE
- Purchase date: 2024-06-15
- Power draw: ~35W idle

## Server: nas-01 (TrueNAS)
- Hardware: Custom build, Fractal Node 304
- CPU: Intel i3-12100 (4C/8T)
- RAM: 32 GB DDR4 ECC
- Storage: 4x 4 TB HDD (ZFS RAIDZ1), 256 GB NVMe (boot)
- Network: 2x 2.5GbE (LACP bond)
- IP: 10.0.20.11 (VLAN 20)
- Usable storage: ~10.9 TB
- Power draw: ~45W idle

## Networking
- Router: Protectli FW4B running OPNsense, IP: 10.0.1.1
- Switch: TP-Link TL-SG116E (16-port managed), IP: 10.0.50.2
- AP: UniFi U6-Lite, IP: 10.0.50.3
- UPS: CyberPower CP1500PFCLCD (900W/1500VA)

Service Inventory

List every VM, container, and service, along with its dependencies:

# Service Inventory

## Proxmox VMs/CTs on pve-01

### CT 100: docker-host (Ubuntu 24.04 LXC)
- Purpose: Runs all Docker services
- IP: 10.0.20.20
- Resources: 4 cores, 8 GB RAM, 100 GB disk
- Services running:
  - Jellyfin (port 8096)
  - Radarr (port 7878)
  - Sonarr (port 8989)
  - Homepage dashboard (port 3000)
  - Uptime Kuma (port 3001)
- Depends on: nas-01 for media storage (NFS mount)
- Backup: Daily to nas-01 via Proxmox Backup Server

### CT 101: pihole (Alpine LXC)
- Purpose: DNS ad blocking
- IP: 10.0.20.21
- Resources: 1 core, 512 MB RAM, 8 GB disk
- Critical: Yes (DNS resolution for entire network)
- Backup: Daily, plus gravity.db export weekly

### VM 200: homeassistant (HAOS)
- Purpose: Home automation
- IP: 10.0.30.10
- Resources: 2 cores, 4 GB RAM, 64 GB disk
- Depends on: IoT VLAN access, MQTT broker
- Backup: Built-in HA backups to NAS + weekly snapshot
- Critical: Yes (automations, family depends on it)

Credentials and Secrets Inventory

Document where every important credential is stored -- without writing the actual credentials in your DR plan:

# Credentials Location Map

- Proxmox root password: Bitwarden vault, "Homelab" folder
- TrueNAS admin password: Bitwarden vault
- OPNsense admin: Bitwarden vault
- SSH keys: ~/.ssh/ on laptop + backup in Bitwarden
- Borg backup passphrase: Bitwarden vault, "Backup Encryption" entry
- ZFS encryption key: /root/zfs.key on nas-01 + Bitwarden vault
- Cloudflare API token: Bitwarden vault
- DDNS credentials: Bitwarden vault
- Wi-Fi passwords: Bitwarden vault, shared with family members

The critical rule: never store your only copy of an encryption key on the system it encrypts. If your NAS dies and the ZFS encryption key was only on the NAS, your backup is unrecoverable.

Step 2: Verify Your Backups

Having backups is meaningless if they do not work. Verification is the most neglected part of homelab backup strategies.

Test Restores Monthly

Schedule a monthly test where you actually restore something:

# Example: Test restoring a Borg backup
# 1. List available archives
borg list /mnt/backup/borg-repo

# 2. Restore the latest archive to a temporary directory
mkdir /tmp/restore-test
borg extract /mnt/backup/borg-repo::archive-name-2026-02-14 \
  --destination /tmp/restore-test

# 3. Verify file integrity
diff -r /tmp/restore-test/etc/important-config /etc/important-config

# 4. Clean up
rm -rf /tmp/restore-test

Automate Backup Verification

Do not rely on remembering to test backups manually. Automate it:

#!/bin/bash
# /usr/local/bin/verify-backups.sh
# Run weekly via cron: 0 3 * * 0 /usr/local/bin/verify-backups.sh

LOG="/var/log/backup-verify.log"
ALERT_EMAIL="[email protected]"
ERRORS=0

echo "=== Backup verification $(date) ===" >> "$LOG"

# Check Borg repository integrity
if ! borg check /mnt/backup/borg-repo >> "$LOG" 2>&1; then
    echo "FAIL: Borg repo check failed" >> "$LOG"
    ERRORS=$((ERRORS + 1))
fi

# Check that the latest archive is recent (within 48 hours)
LATEST=$(borg list --sort-by timestamp --last 1 /mnt/backup/borg-repo \
  --format '{time}' 2>/dev/null)
if [ -z "$LATEST" ]; then
    echo "FAIL: No Borg archives found" >> "$LOG"
    ERRORS=$((ERRORS + 1))
else
    LATEST_TS=$(date -d "$LATEST" +%s 2>/dev/null)
    NOW=$(date +%s)
    AGE_HOURS=$(( (NOW - LATEST_TS) / 3600 ))
    if [ "$AGE_HOURS" -gt 48 ]; then
        echo "FAIL: Latest backup is ${AGE_HOURS}h old (threshold: 48h)" >> "$LOG"
        ERRORS=$((ERRORS + 1))
    else
        echo "OK: Latest backup is ${AGE_HOURS}h old" >> "$LOG"
    fi
fi

# Check Proxmox Backup Server status
if ! curl -sk https://pbs.local:8007/api2/json/status \
  -H "Authorization: PBSAPIToken=verify@pbs!verify:token" \
  | jq -e '.data' > /dev/null 2>&1; then
    echo "FAIL: Cannot reach Proxmox Backup Server" >> "$LOG"
    ERRORS=$((ERRORS + 1))
else
    echo "OK: Proxmox Backup Server reachable" >> "$LOG"
fi

# Report results
if [ "$ERRORS" -gt 0 ]; then
    echo "ALERT: $ERRORS backup verification failures" >> "$LOG"
    mail -s "Backup Verification FAILED ($ERRORS errors)" "$ALERT_EMAIL" < "$LOG"
else
    echo "All backup checks passed" >> "$LOG"
fi

Check Off-Site Backup Freshness

If you push backups to a remote location (cloud storage, a friend's server, a safe deposit box), verify that the remote copy is current:

# Check Backblaze B2 latest backup timestamp
b2 ls --long your-bucket-name | tail -5

# Check rclone remote
rclone lsf --max-depth 1 --format tp remote:backups/ | sort | tail -5

Want more backup guides? Get guides like this in your inbox — HomeLab Starter delivers one free deep-dive every week.

Step 3: Write Recovery Procedures

This is the core of your DR plan. For each disaster scenario, write step-by-step instructions that someone (or a stressed, panicked version of yourself) can follow.

Scenario: Single Service Failure

## Recovery: Single Container/VM Down

Time to recover: 15-30 minutes

1. SSH into Proxmox host: `ssh [email protected]`
2. Check container/VM status: `pct list` or `qm list`
3. Try starting it: `pct start <id>` or `qm start <id>`
4. If it won't start, check logs: `journalctl -u pve-container@<id>`
5. If corrupted, restore from backup:
   a. Open Proxmox web UI > Datacenter > Storage > PBS
   b. Select the latest backup for the affected CT/VM
   c. Click Restore, select target storage
   d. Start the restored CT/VM
6. Verify the service is working (check its web UI, run a health check)
7. Investigate root cause before it happens again

Scenario: Full Server Failure

## Recovery: Primary Proxmox Host Down

Time to recover: 2-4 hours

### Immediate (First 15 minutes)
1. Determine the failure (PSU? Motherboard? Disk?)
2. If disk failure only, replace disk and reinstall Proxmox
3. If hardware failure, obtain replacement hardware (spare, new purchase)

### Reinstall Proxmox (30-60 minutes)
1. Download latest Proxmox ISO from proxmox.com
2. Write to USB: `dd if=proxmox.iso of=/dev/sdX bs=1M`
3. Install Proxmox on new/repaired hardware
4. Post-install: disable enterprise repo, add no-subscription repo
5. Configure network interfaces to match original (see Hardware Inventory)
6. Configure storage pools (ZFS pool import or recreation)

### Restore VMs and Containers (30-90 minutes)
1. Add Proxmox Backup Server storage in web UI
2. For each CT/VM in Service Inventory:
   a. Restore from most recent PBS backup
   b. Verify network configuration (IP, VLAN, bridge)
   c. Start and verify service health
3. Priority order for restoration:
   - CT 101 (Pi-hole) -- DNS is critical for everything else
   - CT 100 (Docker host) -- most services live here
   - VM 200 (Home Assistant) -- family depends on automations

### Verify (15-30 minutes)
1. Check all services from the Service Inventory
2. Verify NFS mounts to NAS are working
3. Check backup jobs are re-scheduled
4. Run a DNS query through Pi-hole to verify
5. Open Home Assistant, verify automations are running

Scenario: Total Loss (Fire, Flood, Theft)

## Recovery: Total Loss of All Equipment

Time to recover: 1-3 days (depends on hardware acquisition)

### Phase 1: Hardware Acquisition (Day 1)
1. Order replacement hardware (see Hardware Inventory for specs)
2. Minimum viable: one server + one switch + one router
3. Budget option: any modern mini PC with 16+ GB RAM

### Phase 2: Base Infrastructure (Day 1-2)
1. Install OPNsense on router hardware
2. Restore OPNsense config from off-site backup:
   - Location: Backblaze B2 bucket "homelab-dr"
   - Path: /opnsense/config-latest.xml
   - Decrypt with age: `age -d -i ~/.secrets/age-key.txt config.xml.age`
3. Configure switch with VLAN layout (see VLAN documentation)
4. Install Proxmox on server

### Phase 3: Critical Services (Day 2)
1. Restore from off-site backups (Backblaze B2):
   - Download Borg repository: `rclone sync b2:homelab-borg /mnt/restore/borg`
   - Extract critical containers
2. Priority: DNS (Pi-hole) > VPN (WireGuard) > NAS > Docker host

### Phase 4: Full Restoration (Day 2-3)
1. Restore remaining VMs and containers from off-site backups
2. Verify all services against Service Inventory checklist
3. Re-establish backup schedules
4. Update documentation with any changes made during recovery

### Credentials Needed (all in Bitwarden):
- Proxmox root password
- Borg backup passphrase
- Backblaze B2 application key
- OPNsense admin password
- Age decryption key (also at physical location: safe deposit box)

Step 4: Off-Site Backup Strategy

Local backups protect against accidental deletion and hardware failure. They do not protect against disasters that destroy all equipment in your home. You need at least one off-site copy.

Cloud Storage

The most practical option for most homelabbers. Push encrypted backups to a cloud provider:

# Borg backup with rclone to Backblaze B2
# Daily cron job
borg create /mnt/backup/borg-repo::'{hostname}-{now}' \
  /etc /home /var/lib/docker/volumes \
  --exclude '*.tmp' --exclude 'node_modules'

# Sync to B2 (only uploads changes)
rclone sync /mnt/backup/borg-repo b2:homelab-borg \
  --transfers 4 --bwlimit 10M

# Also sync critical configs separately for quick access
rclone copy /etc/pve b2:homelab-dr/proxmox-config/
rclone copy /root/opnsense-backup.xml b2:homelab-dr/opnsense/

Cost comparison for 500 GB of off-site backup:

Provider	Monthly Cost	Egress Cost
Backblaze B2	$2.50	$0.01/GB
Wasabi	$3.50	Free
AWS S3 Glacier Instant	$2.00	$0.03/GB
Hetzner Storage Box	$3.81 (1 TB)	Free

Friend/Family Exchange

If you have a friend with a homelab, trade off-site backup space. You each host an encrypted backup for the other:

# Push encrypted Borg repo to friend's server via SSH
borg create ssh://friend@remote-server:22/backup/my-repo::'{now}' \
  /important-data

# The data is encrypted with your Borg passphrase -- your friend cannot read it

Physical Off-Site

For the most critical data (irreplaceable photos, legal documents), consider a physical off-site copy:

Keep an encrypted USB drive or portable SSD at a family member's house
Update it quarterly
Store the encryption key separately (in your password manager, which itself syncs to the cloud)

# Create an encrypted backup to a USB drive
sudo cryptsetup luksFormat /dev/sdc1
sudo cryptsetup open /dev/sdc1 offsite-backup
sudo mkfs.ext4 /dev/mapper/offsite-backup
sudo mount /dev/mapper/offsite-backup /mnt/offsite

# Copy critical data
rsync -av --delete /mnt/important-data/ /mnt/offsite/

# Unmount and close
sudo umount /mnt/offsite
sudo cryptsetup close offsite-backup

Step 5: Test Your DR Plan

A DR plan that has never been tested is just a wish list. Testing proves that your backups work, your procedures are correct, and you can actually execute the recovery.

Tabletop Exercise (Monthly)

Walk through your DR plan mentally. Pick a scenario and trace through the steps:

"My Proxmox host's SSD died. Can I recover? Do I have a spare SSD? Where is the Proxmox ISO? Do I have the PBS credentials handy?"
"A docker volume got corrupted. Can I restore just that volume from Borg? Have I ever done a selective Borg restore?"

Partial Restore Test (Quarterly)

Actually restore something non-critical to verify the process:

# Restore a non-critical container to a different ID
# This proves the backup works without disrupting production
qmrestore /var/lib/vz/dump/vzdump-lxc-101-latest.tar.zst 999 \
  --storage local-zfs

# Start it, verify it works
pct start 999
pct exec 999 -- pihole status

# Delete the test restore
pct stop 999
pct destroy 999

Full DR Test (Annually)

Once a year, do a full recovery test. If you have spare hardware (even a laptop), attempt to rebuild your critical services from backups alone:

Pretend your main hardware is gone
Install Proxmox on the spare hardware
Restore from your off-site backups only (not local backups)
See how far you get and how long it takes
Document every gap, missing credential, and unclear step
Update your DR plan with what you learned

This is uncomfortable and time-consuming. It is also the only way to know whether your DR plan actually works.

Post-Test Review

After every test, update your DR plan:

## DR Test Log

### 2026-02-15 - Quarterly partial restore test
- Restored CT 101 (Pi-hole) from PBS backup: SUCCESS
- Time: 4 minutes to restore, 1 minute to verify
- Issue found: DNS upstream servers were not in the backup
  (configured in Pi-hole web UI, not in a config file)
- Action: Add Pi-hole setupVars.conf to Borg backup paths

Maintaining Your DR Plan

A DR plan is a living document. Update it whenever you:

Add or remove hardware
Create or decommission a service
Change your backup strategy
Change credentials or encryption keys
Move to a new off-site backup provider
Learn something from a test or an actual incident

Set a calendar reminder to review the plan quarterly. Check that the hardware inventory is current, the service list matches reality, and the recovery procedures still make sense.

The best DR plan is one you hope you never need but know will work if you do. The 4-6 hours you spend building and testing it will save you days of panic and data loss when something inevitably goes wrong.