← All articlesSelf-Hosted

Proxmox Clustering: High Availability for Your Self-Hosted Infrastructure

Turn a single Proxmox node into a resilient cluster. Corosync, quorum, live migration, shared storage with Ceph, and fencing — everything you need for self-hosted HA without VMware pricing.

Yash Pritwani

24 March 202612 min read read

Running a single Proxmox node works fine until it does not. A failed disk, a kernel panic, or a bad update can take down every virtual machine and container in one shot. For homelab setups that host anything important — a family photo server, a business application, a personal VPN — that kind of downtime is painful. Proxmox VE supports full multi-node clustering with built-in high availability, shared storage via Ceph, and live VM migration, bringing enterprise-grade resilience to self-hosted infrastructure at zero licensing cost.

Why Cluster Proxmox?

A single Proxmox host is a single point of failure. Clustering solves this by spreading workloads across multiple physical nodes so that when one node goes offline, its virtual machines can restart automatically on a surviving node.

Beyond redundancy, a Proxmox cluster gives you:

Live migration: Move running VMs between nodes with zero downtime
Centralized management: One web interface manages all nodes simultaneously
Shared storage: Ceph or NFS allows any node to access any VM disk
HA groups and priorities: Define exactly which VMs are critical and where they should land

The minimum viable cluster is three nodes. Two nodes create a split-brain problem. Three nodes solve this through quorum.

Understanding Corosync and Quorum

Corosync is the cluster communication layer underneath Proxmox. Every node sends heartbeat messages to every other node. When a node stops responding, the surviving nodes vote on whether to continue operating.

Quorum requires more than half the cluster nodes to agree before taking any action. In a three-node cluster, two nodes constitute a quorum.

pvecm status

The Corosync Ring

Corosync supports multiple network rings for redundancy. Ring 0 is the primary cluster communication path. Ring 1 is optional backup. The cluster network should be isolated from VM traffic and management traffic.

Step-by-Step Cluster Creation

Prerequisites:

Same Proxmox VE version on all nodes
All nodes reachable by hostname
SSH root access between all nodes
Dedicated network interface for cluster traffic
NTP synchronized on all nodes

Get more insights on Self-Hosted

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

Step 1: Create the Cluster

pvecm create my-cluster --ring0_addr 10.10.0.1
pvecm status

Step 2: Add the Second Node

pvecm add 10.10.0.1 --ring0_addr 10.10.0.2
pvecm nodes

Step 3: Add the Third Node

pvecm add 10.10.0.1 --ring0_addr 10.10.0.3
pvecm status

Shared Storage with Ceph

Ceph is a distributed storage system that Proxmox supports natively. Unlike NFS or iSCSI, Ceph has no single point of failure — storage is distributed across all nodes.

Ceph Components

MON (Monitor): Tracks cluster state and quorum. One per Proxmox node.
OSD (Object Storage Daemon): Manages one storage disk each. Three or more required.
MGR (Manager): Dashboard and metrics. One per node.

Installing and Configuring Ceph

# Install on each node
pveceph install

# Initialize on first node
pveceph init --network 10.10.0.0/24

# Add monitors on each node
pveceph mon create

# Add OSDs (one per disk, per node)
pveceph osd create /dev/sdb

# Create storage pool
pveceph pool create vm-pool --add_storages true

Live Migration

With shared Ceph storage, migrate running VMs without downtime:

→

The Complete Self-Hosted SaaS Stack for 2026: Replace $5,000/Month in SaaS Subscriptions13 min read

→

Why Self-Hosted Infrastructure Is the Future for Startups in 202512 min read

→

Complete Guide to Setting Up a Private Company Server in 202520 min read

qm migrate <vmid> <target-node> --online

Live migration is invaluable for maintenance. Before rebooting a node for a kernel update, migrate all VMs to other nodes, perform the update, and migrate them back.

HA Groups and Fencing

What is Fencing?

Fencing forcibly shuts down a failed node before restarting its VMs elsewhere. Without fencing, two nodes could write to the same VM disk simultaneously, causing data corruption.

Enable the watchdog:

modprobe softdog

Enabling HA for a VM

ha-manager add vm:<vmid> --state started --max_restart 3

Creating HA Groups

ha-manager groupadd primary-group --nodes node1:2,node2:1,node3:1
ha-manager set vm:<vmid> --group primary-group

Monitoring HA Status

ha-manager status
pvesh get /cluster/ha/resources

Networking Considerations

Free Resource

Free Cloud Architecture Checklist

A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.

Download the Checklist

Three-Network Architecture

Management network: Web interface, SSH, API calls
Cluster/Corosync network: Heartbeats, cluster state. Must be low-latency and reliable.
Storage/migration network: Ceph replication, live migration. High-bandwidth, ideally 10GbE.

This separation ensures that a VM migration saturating the storage network does not cause Corosync heartbeat timeouts.

Corosync Tuning

# /etc/corosync/corosync.conf — totem section
token: 3000
token_retransmits_before_loss_const: 10

MTU and Jumbo Frames

Enable jumbo frames (MTU 9000) on storage and cluster networks for significantly improved Ceph throughput. Ensure switches support it on the relevant ports.

Monitoring the Cluster

pvecm status
pvecm nodes
ceph status
ha-manager status

Prometheus Integration

The pve-exporter project provides a dedicated Prometheus exporter for Proxmox that exposes VM-level metrics, storage usage, and node health in a Grafana-ready format.

Log Monitoring

journalctl -u corosync -f

Ceph logs at /var/log/ceph/. Forward all cluster node logs to Loki or Graylog for visibility that survives a full cluster outage.

Summary

A three-node Proxmox cluster with Ceph shared storage and HA groups transforms a homelab from a collection of individual machines into a resilient platform. Corosync maintains cluster consensus, Ceph ensures VM disks are accessible from any node, and the HA manager handles automatic failover. The operational benefits go beyond redundancy — live migration makes hardware maintenance non-disruptive and Ceph scales horizontally. A single node failure becomes a minor event rather than an emergency.

#proxmox#high-availability#clustering#self-hosted#ceph#virtualization#homelab

Related Service

Cloud Solutions

Let our experts help you build the right technology strategy for your business.

Get a Consultation Chat on WhatsApp

Need help with self-hosted?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

Get a Free Consultation WhatsApp Us

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us

99.99% uptime

< 48hr response

No spam. No contracts. Just a free demo.