Proxmox VE Clustering for High Availability

Virtualization 2026-02-09 · 7 min read proxmox clustering high-availability virtualization live-migration

A single Proxmox server is a great foundation for a homelab. Two or three Proxmox servers in a cluster unlock the features that make virtualization genuinely powerful: live migration (move running VMs between hosts without downtime), high availability (VMs restart automatically on a surviving node when a host fails), and centralized management of all your virtualization infrastructure from one web interface.

Proxmox VE clustering is built on mature Linux technologies — Corosync for cluster communication, a distributed configuration filesystem (pmxcfs), and the Proxmox HA manager for failover. It works on commodity hardware, doesn't require matching configurations across nodes, and the setup takes about 15 minutes once you understand the requirements.

Prerequisites

Before creating a cluster, make sure your environment meets these requirements:

Network:

All nodes must be able to reach each other on a dedicated cluster network (ideally a separate NIC or VLAN)
Low latency between nodes (<2ms round trip). Nodes across a WAN link won't work reliably
The cluster network carries Corosync heartbeats. If it goes down, the cluster assumes a node is dead

Hostnames and DNS:

Each node needs a unique hostname
Hostnames must resolve correctly. Check /etc/hosts on each node — the hostname must NOT resolve to 127.0.0.1. It must point to the node's real cluster network IP

Time:

All nodes must have synchronized time. Install and configure chrony or NTP on every node

Fresh or compatible installations:

Joining a node to a cluster wipes its existing Proxmox configuration. VMs and containers on the joining node are preserved, but the cluster config is replaced with the cluster's. Plan accordingly.

Example Setup

For this guide:

pve1: 192.168.1.101 (cluster creation node)
pve2: 192.168.1.102
pve3: 192.168.1.103
Cluster network: same subnet, dedicated VLAN preferred

Verify /etc/hosts on each node:

192.168.1.101  pve1
192.168.1.102  pve2
192.168.1.103  pve3

Do NOT have entries like 127.0.1.1 pve1 — this causes Corosync binding issues.

Creating the Cluster

On the first node (pve1), create the cluster:

pvecm create homelab-cluster

That's it. One command. Verify it:

pvecm status

You should see a cluster with one node. The web UI (https://192.168.1.101:8006) now shows the cluster name.

Specifying the Cluster Network

If you have a dedicated cluster network interface, specify it during creation:

pvecm create homelab-cluster --link0 192.168.10.101

This binds Corosync to the dedicated interface. For redundancy, add a second link:

pvecm create homelab-cluster --link0 192.168.10.101 --link1 192.168.1.101

Dual links mean the cluster survives the failure of one network path.

Joining Nodes

On pve2 and pve3, join the cluster by pointing at any existing cluster member:

# On pve2
pvecm add 192.168.1.101

# On pve3
pvecm add 192.168.1.101

You'll be prompted for the root password of the target node. After joining, verify:

pvecm status

All three nodes should appear with status. The web UI on any node now shows all three nodes and their VMs/containers.

If Joining Fails

Common issues:

SSH key conflicts: Clear /root/.ssh/known_hosts entries for cluster nodes and retry
Hostname resolution: Double-check /etc/hosts on all nodes
Firewall: Ports 8006 (web), 5405-5412/udp (Corosync), 22 (SSH), 60000-60050 (live migration) must be open between nodes
Time skew: If clocks differ by more than a few seconds, Corosync refuses to form quorum

Quorum

A cluster with three nodes requires at least two nodes to be online to have quorum (a majority). Without quorum, the cluster becomes read-only to prevent split-brain scenarios.

3 nodes: Can tolerate 1 node failure
2 nodes: No fault tolerance (losing either node loses quorum). This is problematic — see below
5 nodes: Can tolerate 2 failures

The Two-Node Problem

Two Proxmox nodes can form a cluster, but if either goes down, the survivor doesn't have quorum and HA won't function. Solutions:

QDevice (Corosync QNet): Add a lightweight third vote from a Raspberry Pi or any Linux machine. It doesn't run Proxmox — it just provides the tie-breaking vote.

# On the QDevice host (any small Linux box)
sudo apt install corosync-qnetd

# On a Proxmox node
pvecm qdevice setup 192.168.1.200

Three nodes: Even a modest third node (a mini PC or old laptop running Proxmox) provides genuine three-way quorum.

For homelabs, the QDevice approach is popular because it doesn't require a third full server.

Shared Storage for Live Migration and HA

Live migration and HA require that VM disk images are accessible from all nodes simultaneously. A VM can only move to another node if that node can access the same disk.

Options for shared storage:

NFS

The simplest option. Export a directory from your NAS and add it as storage on all Proxmox nodes:

Datacenter > Storage > Add > NFS
  Server: 192.168.1.50
  Export: /mnt/pool/proxmox
  Content: Disk image, ISO image, Container template

NFS works well for homelab clusters. Performance is adequate for most workloads, and setup is trivial.

Ceph (Built Into Proxmox)

Proxmox includes Ceph integration. Each node contributes local disks to a distributed storage pool. No external NAS needed. This is the most "proper" solution but requires:

At least 3 nodes (for replication)
Dedicated SSDs or disks on each node for Ceph OSDs
A dedicated network for Ceph traffic (10 GbE recommended)

For a three-node homelab cluster, Ceph with SSDs provides excellent performance and redundancy. Setup is done through the Proxmox web UI under Datacenter > Ceph.

iSCSI

Presents a block device from your NAS to all Proxmox nodes. Better raw performance than NFS for I/O-intensive VMs. More complex to set up but well-supported by Proxmox.

ZFS over iSCSI

If your NAS runs ZFS, you can expose ZFS volumes as iSCSI targets. Proxmox has a dedicated storage plugin for this.

Enabling High Availability

With shared storage in place, enabling HA for a VM or container is straightforward.

Via the Web UI

Select a VM or container
Go to More > Manage HA
Set the HA group and priority
Choose the requested state (started, stopped, disabled)

Via the Command Line

# Add VM 100 to HA with max_restart of 3
ha-manager add vm:100 --state started --max_restart 3 --max_relocate 1

# Check HA status
ha-manager status

HA Groups

HA groups define which nodes a VM can run on and their priority:

# Create a group
ha-manager groupadd preferred-nodes --nodes pve1,pve2 --nofailback 0

nodes — Which cluster nodes are eligible to host VMs in this group
nofailback — If set to 0, VMs migrate back to the preferred node when it recovers. If 1, they stay where they are after failover

What Happens During a Node Failure

Corosync detects the node is unreachable (after ~30 seconds of missed heartbeats)
The HA manager on a surviving node takes over management responsibility
The failed node is fenced (more on this below)
HA-managed VMs from the failed node are restarted on surviving nodes
Restart happens in priority order with configurable delays

Total failover time is typically 1-3 minutes, depending on fencing method and VM boot time.

Fencing

Fencing ensures that a failed node is truly stopped before its VMs are started elsewhere. Without fencing, you risk two copies of the same VM running simultaneously, which corrupts data.

Proxmox supports several fencing methods:

Hardware Watchdog (Recommended for Homelabs)

Most server hardware has an IPMI/iDRAC/iLO watchdog timer. If the node stops refreshing the watchdog, the hardware forces a reboot.

# Check if a hardware watchdog is available
ls /dev/watchdog*

# Proxmox uses the softdog module as a fallback

Proxmox configures the HA manager to use a watchdog by default. If the HA manager on a node loses cluster communication, the watchdog triggers a reboot after a timeout, ensuring the node doesn't continue running VMs that are being started elsewhere.

IPMI Fencing

For more reliable fencing, configure IPMI so surviving nodes can force-power-off the failed node:

# Test IPMI connectivity
ipmitool -I lanplus -H 192.168.1.201 -U admin -P password power status

Configure in /etc/pve/ha/fence.cfg:

device ipmi pve1 {
    cmd "ipmitool -I lanplus -H 192.168.1.201 -U admin -P password power off"
}

Live Migration

With shared storage, you can move running VMs between nodes with zero downtime:

Via the Web UI

Right-click a VM > Migrate > Select target node > Migrate

Via the Command Line

# Live migrate VM 100 to pve2
qm migrate 100 pve2 --online

Live migration copies the VM's RAM contents to the target node while it continues running, then switches over in the final milliseconds. The VM experiences a brief pause (typically under 100ms) during the switchover.

Requirements for live migration:

Shared storage for the VM's disk
Sufficient RAM on the target node
Same CPU architecture (Intel to Intel, AMD to AMD). Mixed works with CPU type set to a common baseline (e.g., x86-64-v2-AES)
Network connectivity between nodes on the migration network (ports 60000-60050)

Cluster Network Best Practices

Separate cluster traffic from VM traffic. Corosync heartbeats are small but latency-sensitive. If your cluster network shares bandwidth with a large VM backup or migration, missed heartbeats can trigger false failovers.

Use a dedicated VLAN or physical NIC for Corosync. Even a separate 1 GbE link dedicated to cluster traffic is better than sharing a 10 GbE link with everything else.

Use link bonding for redundancy. A single network cable failure shouldn't partition your cluster. Bond two interfaces or use dual Corosync links.

Set up a dedicated migration network. Under Datacenter > Options > Migration Settings, specify a network for live migration traffic. This prevents large migrations from saturating your cluster or production network.

Maintenance

Removing a Node

If you need to permanently remove a node:

Migrate all VMs and containers off the node
Remove HA resources from that node
On the node being removed: pvecm delnode NODENAME
On a remaining node: pvecm delnode NODENAME (if the removed node is already offline)

Updating the Cluster

Update nodes one at a time. Migrate VMs off a node, update it, reboot, verify it rejoins the cluster, then move to the next node. This rolling update approach keeps your services available throughout.

A Proxmox cluster transforms your homelab from "a couple of servers" into genuine infrastructure. VMs survive hardware failures, maintenance doesn't require downtime, and you manage everything from a single interface. The setup is straightforward enough to complete in an afternoon, and the operational benefits are immediate.