Layered Backup for a Kubernetes Homelab: snapper, btrbk, and a Raspberry Pi

Running Kubernetes on a single VPS is convenient — but it concentrates risk. Every application, every PVC, every bit of state lives on one machine. If that machine disappears, so does everything else. Backup is not optional here; it’s the difference between a recoverable incident and a weekend rebuild.

This post explains the backup strategy I settled on for my homelab and why the layers are the way they are. The full technical reference — including step-by-step restore commands — is in the homelab documentation.


The Problem with Kubernetes Backup

Kubernetes adds a layer of indirection that makes backup less obvious than on a traditional server. Persistent data lives in PVCs (/var/lib/pvc/), but so does cluster state (/var/lib/microshift/), container images (/var/lib/containers/), and the operating system itself (/). A kubectl backup tool like Velero solves the application layer but not the host OS. A pure filesystem backup gets everything but doesn’t understand Kubernetes resources.

Since the entire stack runs on Btrfs subvolumes, the filesystem itself is the right abstraction. Every component — OS, cluster state, app data — is a separate subvolume, snapshottable independently, and sendable to a remote host with btrfs send/receive.


Three Layers, Three Jobs

Layer 1: snapper — instant local rollback

snapper creates timeline snapshots (hourly, daily, weekly) for all 8 relevant subvolumes. The snapshots are directly accessible as read-only directories under <mountpoint>/.snapshots/<nr>/snapshot/ — no mounting, no service interruption. When a Flux reconcile breaks something, a rollback is:

snapper -c root rollback 116
reboot

For app data, it’s even simpler — just copy the files back:

cp -a /var/lib/pvc/.snapshots/42/snapshot/<pvc-dir>/. /var/lib/pvc/<pvc-dir>/

snapper also integrates with GRUB via grub-btrfs, which adds every snapshot to the boot menu automatically. Booting into a broken system is optional — you can roll back from a running system.

Layer 2: btrbk — remote backup to the Raspberry Pi

btrbk takes Btrfs snapshots and sends them incrementally to the Raspberry Pi over SSH via WireGuard. Every subvolume is transferred separately, so the Pi holds an independent, point-in-time-consistent copy of each one.

Since snapper handles local rollback, local btrbk snapshots are kept minimal: only the latest snapshot is retained as a parent reference for the next incremental send. Without a local parent snapshot, btrbk falls back to a full send — slow and bandwidth-heavy. With one, the daily delta is typically just a few hundred megabytes.

Layer 3: pre-update snapshots — automatic safety net before every change

A GitHub Action triggers before every Flux reconcile: it SSHes into the server, suspends Flux, runs snap-all "vor-flux-update-YYYYMMDD-HHMMSS", then resumes Flux. The reconcile only happens after the snapshot exists.

snap-all is a small script that creates a numbered snapshot across all 8 snapper configs in one call. Numbered snapshots use --cleanup-algorithm number — they count against NUMBER_LIMIT and are eventually pruned, but they’re not part of the timeline rotation, so they don’t interfere with the hourly/daily/weekly schedule.


What Gets Backed Up

SubvolumeContainsRecoverable without backup?
rootOS, packages, /etcReinstall + reconfigure — painful
var_lib_pvcAll app PVCs (Nextcloud, mail, Vaultwarden, Paperless…)No
var_lib_microshiftCluster state, etcd, kubeconfigNo
dataGitOps repo, scriptsGit history on GitHub — yes, but local scripts lost
homeUser filesMostly reproducible, but annoying
varSystem state, boot backupMostly rebuildable
var_logLogsRebuildable (empty)
var_lib_containersContainer imagesRebuildable (re-pull), but slow

The three subvolumes intentionally excluded — var_cache, var_tmp, var_lib_kubelet — are either ephemeral or trivially rebuildable.


Monitoring: the check script

A daily systemd timer runs check-backup-btrfs-subvolumes.sh. It does two things:

  1. Checks every Btrfs subvolume against btrbk.conf — alerts if a new subvolume was created but not added to the backup config
  2. Checks every subvolume against /etc/snapper/configs/ — alerts if snapper coverage is missing

On failure, OnFailure=check-btrbk-notify.service sends a Telegram message. The goal: no new subvolume silently goes unprotected.


What Full Recovery Looks Like

If the VM disappears completely:

  1. Provision a fresh Fedora 43 VM
  2. Restore the original partition table from the Pi backup (it’s stored in the var snapshot under /var/lib/boot-backup/)
  3. Create the filesystems with the original UUIDs — fstab and GRUB reference them by UUID, so they must match
  4. btrfs receive all subvolumes from the Pi into the new filesystem
  5. Restore /boot and /boot/efi from the boot backup
  6. grub2-install + grub2-mkconfig inside a chroot
  7. Touch /.autorelabel and reboot — SELinux relabels everything on first boot

The Pi holds 24 hourly + 8 daily + 5 weekly snapshots per subvolume, giving a five-week recovery window. You can pick any point in that window by choosing an older snapshot in step 4.

The full command sequence is in the homelab docs.


What I’d Do Differently

The Raspberry Pi as a backup target is a single point of failure. It’s on the same home network, same power circuit, same physical location. For a true disaster (house fire, theft), it fails with the rest. A second backup copy to an object store (Backblaze B2, Hetzner Storage Box) would close that gap. I haven’t added it yet because the attack surface of btrfs send piped over SSH to a managed object store is non-trivial to get right.

btrfs send doesn’t encrypt. The Pi backup is stored unencrypted on the local filesystem. For a remote/cloud backup target, encryption would be mandatory. One option: pipe through age before storing. Another: a LUKS-encrypted external volume on the Pi.

Recovery has never been tested end-to-end. The individual restore commands have been run in isolation. A full VM-loss drill — provision, receive, boot — has not. The documentation exists; the confidence that it works as written does not. That’s the next item on the list.