Proxmox Clustering: High Availability for Your Self-Hosted Infrastructure
Turn a single Proxmox node into a resilient cluster. Corosync, quorum, live migration, shared storage with Ceph, and fencing — everything you need for self-hosted HA without VMware pricing.
Running a single Proxmox node works fine until it does not. A failed disk, a kernel panic, or a bad update can take down every virtual machine and container in one shot. For homelab setups that host anything important — a family photo server, a business application, a personal VPN — that kind of downtime is painful. Proxmox VE supports full multi-node clustering with built-in high availability, shared storage via Ceph, and live VM migration, bringing enterprise-grade resilience to self-hosted infrastructure at zero licensing cost.
Why Cluster Proxmox?
A single Proxmox host is a single point of failure. Clustering solves this by spreading workloads across multiple physical nodes so that when one node goes offline, its virtual machines can restart automatically on a surviving node.
Beyond redundancy, a Proxmox cluster gives you:
- Live migration: Move running VMs between nodes with zero downtime
- Centralized management: One web interface manages all nodes simultaneously
- Shared storage: Ceph or NFS allows any node to access any VM disk
- HA groups and priorities: Define exactly which VMs are critical and where they should land
The minimum viable cluster is three nodes. Two nodes create a split-brain problem. Three nodes solve this through quorum.
Understanding Corosync and Quorum
Corosync is the cluster communication layer underneath Proxmox. Every node sends heartbeat messages to every other node. When a node stops responding, the surviving nodes vote on whether to continue operating.
Quorum requires more than half the cluster nodes to agree before taking any action. In a three-node cluster, two nodes constitute a quorum.
pvecm status
The Corosync Ring
Corosync supports multiple network rings for redundancy. Ring 0 is the primary cluster communication path. Ring 1 is optional backup. The cluster network should be isolated from VM traffic and management traffic.
Step-by-Step Cluster Creation
Prerequisites:
- Same Proxmox VE version on all nodes
- All nodes reachable by hostname
- SSH root access between all nodes
- Dedicated network interface for cluster traffic
- NTP synchronized on all nodes
Get more insights on Self-Hosted
Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.
Step 1: Create the Cluster
pvecm create my-cluster --ring0_addr 10.10.0.1
pvecm status
Step 2: Add the Second Node
pvecm add 10.10.0.1 --ring0_addr 10.10.0.2
pvecm nodes
Step 3: Add the Third Node
pvecm add 10.10.0.1 --ring0_addr 10.10.0.3
pvecm status
Shared Storage with Ceph
Ceph is a distributed storage system that Proxmox supports natively. Unlike NFS or iSCSI, Ceph has no single point of failure — storage is distributed across all nodes.
Ceph Components
- MON (Monitor): Tracks cluster state and quorum. One per Proxmox node.
- OSD (Object Storage Daemon): Manages one storage disk each. Three or more required.
- MGR (Manager): Dashboard and metrics. One per node.
Installing and Configuring Ceph
# Install on each node
pveceph install
# Initialize on first node
pveceph init --network 10.10.0.0/24
# Add monitors on each node
pveceph mon create
# Add OSDs (one per disk, per node)
pveceph osd create /dev/sdb
# Create storage pool
pveceph pool create vm-pool --add_storages true
Live Migration
With shared Ceph storage, migrate running VMs without downtime:
qm migrate <vmid> <target-node> --online
Live migration is invaluable for maintenance. Before rebooting a node for a kernel update, migrate all VMs to other nodes, perform the update, and migrate them back.
HA Groups and Fencing
What is Fencing?
Fencing forcibly shuts down a failed node before restarting its VMs elsewhere. Without fencing, two nodes could write to the same VM disk simultaneously, causing data corruption.
Enable the watchdog:
modprobe softdog
Enabling HA for a VM
ha-manager add vm:<vmid> --state started --max_restart 3
Creating HA Groups
ha-manager groupadd primary-group --nodes node1:2,node2:1,node3:1
ha-manager set vm:<vmid> --group primary-group
Monitoring HA Status
ha-manager status
pvesh get /cluster/ha/resources
Networking Considerations
Free Resource
Free Cloud Architecture Checklist
A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.
Three-Network Architecture
- Management network: Web interface, SSH, API calls
- Cluster/Corosync network: Heartbeats, cluster state. Must be low-latency and reliable.
- Storage/migration network: Ceph replication, live migration. High-bandwidth, ideally 10GbE.
This separation ensures that a VM migration saturating the storage network does not cause Corosync heartbeat timeouts.
Corosync Tuning
# /etc/corosync/corosync.conf — totem section
token: 3000
token_retransmits_before_loss_const: 10
MTU and Jumbo Frames
Enable jumbo frames (MTU 9000) on storage and cluster networks for significantly improved Ceph throughput. Ensure switches support it on the relevant ports.
Monitoring the Cluster
pvecm status
pvecm nodes
ceph status
ha-manager status
Prometheus Integration
The pve-exporter project provides a dedicated Prometheus exporter for Proxmox that exposes VM-level metrics, storage usage, and node health in a Grafana-ready format.
Log Monitoring
journalctl -u corosync -f
Ceph logs at /var/log/ceph/. Forward all cluster node logs to Loki or Graylog for visibility that survives a full cluster outage.
Summary
A three-node Proxmox cluster with Ceph shared storage and HA groups transforms a homelab from a collection of individual machines into a resilient platform. Corosync maintains cluster consensus, Ceph ensures VM disks are accessible from any node, and the HA manager handles automatic failover. The operational benefits go beyond redundancy — live migration makes hardware maintenance non-disruptive and Ceph scales horizontally. A single node failure becomes a minor event rather than an emergency.
Related Service
Cloud Solutions
Let our experts help you build the right technology strategy for your business.
Need help with self-hosted?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.
We Will Build You a Demo Site — For Free
Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.
No spam. No contracts. Just a free demo.