Self-Hosted Cloud on a VPS: Fedora Server + MicroShift + GitOps


A complete walkthrough of running a Kubernetes-based self-hosted stack on a single VPS — WireGuard VPN, GitOps with Flux, Btrfs snapshots, wildcard TLS, and a full application suite including Nextcloud, Vaultwarden, Paperless, a mail stack, and monitoring. Everything bootstrapped from scratch. Every step documented.

This is written for a Linux administrator who wants to replicate the setup. All personal identifiers have been replaced with generic placeholders.

Last updated: 2026-05-15


Table of Contents


Changelog

What changed on 2026-05-15 (2):

  • GitHub repository renamed from gitops to homelab
  • Manifest folder renamed from gitops/ to configuration/ inside the repository; all Flux path references updated accordingly

What changed on 2026-05-15:

  • snapper — extended to all 8 btrbk subvolumes (added var, var_log, var_lib_containers); all configs use identical retention (24h/8d/5w); snap-all updated accordingly
  • btrbk — local retention reduced: snapshot_preserve_min latest, snapshot_preserve 2h 1d 0w; snapper handles local rollback, btrbk local snapshots only serve as parent reference for incremental send; remote retention (Pi) unchanged
  • check-backup-btrfs-subvolumes.sh — renamed from check-btrbk-subvolumes.sh; extended to also verify every subvolume has a snapper config (not just btrbk)
  • pre-flux-snapshot.sh — now uses snap-all (all 8 configs) instead of snapper -c root create
  • rspamd — memory limit raised from 512Mi to 768Mi; rspamd gets OOMKilled on every node reboot due to a startup memory spike exceeding the old limit
  • postfixadmin — lifecycle postStart hook removed; the hook patched config.local.php to inject encrypt = system, but POSTFIXADMIN_ENCRYPT=system is natively supported by library/postfixadmin:4.0.1-apache; the hook caused a restart on every node reboot (exit code 124, startup race: DB not yet ready → until loop hangs → Kubernetes timeout)
  • grafana-operator — leader election disabled (leaderElect: false); the operator was crashing ~3×/day (exit code 2) because the kube-apiserver lease renewal timed out during brief etcd compaction pauses; leader election is pointless on a single-node cluster

What changed on 2026-05-12:

  • kindnet CNI — removed the broken kindnet-fix.service workaround; MicroShift 4.21 ships its own internal kindnet manifests with the wrong CIDR; fix: kustomizePaths override excludes /usr/lib/microshift/manifests.d/000-microshift-kindnet/
  • snapper — extended to 5 subvolumes (root, home, data, var_lib_pvc, var_lib_microshift); NUMBER_LIMIT for root reduced to 20; snap-all convenience script
  • fail2ban — 3 new jails: sshd-unknown (maxretry=1, 7d), dovecot-unknown (maxretry=1, 7d), postfix-rcpt-unknown (maxretry=3, 1h findtime, 7d)
  • Postfix / DovecothostPort replaced by hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet so fail2ban sees real client IPs; log-tailer sidecar for kubectl logs
  • acme-dns — port 53 now via ClusterIP Service with externalIPs instead of hostPort
  • Flux — version 2.8.3 → 2.8.6; all custom scripts centralised in /data/scripts/ (Btrfs-backed) with symlinks in /usr/local/bin/
  • MicroShift configapiServer.logLevel: Warning + auditLog limits (200 MB / 3 files / 7 days)
  • podman-image-cleanup — new weekly script (Mon 03:00) with Telegram report

Why MicroShift?

Standard Kubernetes (kubeadm, k3s, k0s) on a single VPS works, but MicroShift brings some specific advantages for an edge/single-node scenario:

  • Minimal footprint: Ships without the heavy components (etcd cluster, controller-manager HA). Uses CRI-O and an embedded etcd.
  • OpenShift semantics: Security Context Constraints (SCC) instead of PodSecurityAdmission. More expressive, but requires explicit SCC bindings for every workload.
  • HAProxy ingress included: openshift-ingress/router-default handles TLS termination out of the box.
  • OVN-Kubernetes by default — but we replace it with kube-kindnet (simpler, single-node appropriate).

The tradeoff: SCCs require additional boilerplate for every service account, and kindnet’s default POD_SUBNET doesn’t match MicroShift’s pod CIDR — a one-line config override fixes this permanently.


Architecture Overview

graph TD
  Internet["Internet<br/>443/80 HTTPS/HTTP · 25/587/993 Mail · 51820/udp WireGuard"]

  subgraph VPS["Fedora Server 43 · 8 vCPU · 16 GiB · Btrfs pool"]
    subgraph K8S["MicroShift 4.21 · OKD/SCOS"]
      pihole["pihole — DNS server"]
      vault["vaultwarden — Password manager"]
      paper["paperless — Document management"]
      nc["nextcloud — File sync"]
      collab["collabora — Online Office"]
      mail["mailstack — Mail server"]
      mon["monitoring — Grafana"]
      hp["homepage — Static website"]
      cert["cert-manager — TLS automation"]
      acme["acme-dns — ACME DNS-01"]

      pihole ~~~ vault ~~~ paper ~~~ nc ~~~ collab ~~~ mail ~~~ mon ~~~ hp ~~~ cert ~~~ acme
    end
    WG["WireGuard wg0 · 10.0.0.1/24"]
    acme ~~~ WG
  end

  Pi["Raspberry Pi · 10.0.0.12<br/>btrbk remote backup"]

  Internet --> pihole
  WG --> Pi

GitOps: All Kubernetes manifests live in a GitHub repository (github.com/youruser/homelab). Flux CD watches the repo and applies changes automatically. A GitHub Action creates a Btrfs snapshot before every push.

TLS: One wildcard certificate from Let’s Encrypt covers all domains. cert-manager + acme-dns handle DNS-01 challenges. A systemd timer syncs the certificate to every app namespace weekly.

Backups: btrbk creates hourly snapshots of all Btrfs subvolumes and sends them via SSH to a Raspberry Pi. grub-btrfs registers every snapshot in the GRUB menu for easy rollback.


Prerequisites

ResourceValue
OSFedora Server 43 (fresh install)
vCPUs8
RAM16 GiB
Disk≥ 512 GiB (Btrfs)
NetworkPublic IPv4, ports 22/80/443/51820 reachable
AccessRoot SSH with key auth
DNSA domain you control at a registrar (for ACME and ingress hostnames)
GitHubAccount with a private homelab repository
Grafana CloudFree tier account (for monitoring)
TelegramBot token + chat ID (for notifications — optional but recommended)

The Raspberry Pi (remote backup target) is optional but strongly recommended for disaster recovery.


Step 1 — Base System

SSH Hardening

Move SSH to a non-standard port and disable password auth:

sed -i 's/^#Port 22/Port 222/' /etc/ssh/sshd_config
grep -q "^PermitRootLogin" /etc/ssh/sshd_config \
  && sed -i 's/^PermitRootLogin.*/PermitRootLogin prohibit-password/' /etc/ssh/sshd_config \
  || echo "PermitRootLogin prohibit-password" >> /etc/ssh/sshd_config
grep -q "^PasswordAuthentication" /etc/ssh/sshd_config \
  && sed -i 's/^PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config \
  || echo "PasswordAuthentication no" >> /etc/ssh/sshd_config

# SELinux: allow sshd on port 222
dnf install -y policycoreutils-python-utils
semanage port -a -t ssh_port_t -p tcp 222

# Copy your authorized_keys, then:
systemctl restart sshd

Firewall exceptions for port 222 come in Step 3.

etckeeper

Version-control /etc from day one:

dnf install -y etckeeper
etckeeper init
etckeeper commit "initial fedora server setup"

All subsequent /etc changes should be followed by etckeeper commit "<description>". The git log becomes your authoritative change history.

Base Packages

dnf install -y \
  vim-enhanced htop jq \
  podman btrfs-progs \
  fail2ban wireguard-tools \
  snapper btrbk \
  dnf5-automatic \
  policycoreutils-python-utils

Sysctl Tuning

cat > /etc/sysctl.d/99-microshift.conf <<'EOF'
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 16384
EOF

cat > /etc/sysctl.d/90-redis.conf <<'EOF'
vm.overcommit_memory = 1
EOF

cat > /etc/sysctl.d/90-wireguard.conf <<'EOF'
net.ipv4.ip_forward = 1
EOF

sysctl --system

vm.overcommit_memory=1 is required by Redis (used by Rspamd and Nextcloud). The inotify limits are needed by MicroShift/Flux at scale.

Telegram Notifications (optional)

Create a Telegram bot via @BotFather, start a chat and note your chat ID (use @userinfobot). Store credentials in /etc/telegramrc:

install -m 600 /dev/stdin /etc/telegramrc <<'EOF'
TOKEN=<YOUR_BOT_TOKEN>
CHATID=<YOUR_CHAT_ID>
EOF

sendtelegram.sh — reads /etc/telegramrc and sends a message via the Telegram Bot API:

#!/bin/bash
# Usage: sendtelegram.sh [-c configfile] [-t token] [-i chatid] [-m message]

while getopts ":c:t:i:p:m:v" opt; do
    case "$opt" in
        c) CONFIGFILE=$OPTARG ;;  t) TOKEN_ARG=$OPTARG ;;
        i) CHATID_ARG=$OPTARG ;;  m) TEXT=$OPTARG ;;
        p) PARSEMODE_ARG=$OPTARG ;; v) VERBOSE=1 ;;
    esac
done

if [ -n "$CONFIGFILE" ]; then . "$CONFIGFILE"
elif [ -f /etc/telegramrc ]; then . /etc/telegramrc; fi

if [ -n "$TOKEN_ARG" ]; then TOKEN=$TOKEN_ARG; fi
if [ -n "$CHATID_ARG" ]; then CHATID=$CHATID_ARG; fi

URL="https://api.telegram.org/bot$TOKEN/sendMessage"
CMDARGS="chat_id=$CHATID&disable_web_page_preview=1&text=$TEXT"
[ -n "${PARSEMODE_ARG:-}" ] && CMDARGS="${CMDARGS}&parse_mode=$PARSEMODE_ARG"
curl -s --max-time 10 -d "$CMDARGS" "$URL" > /dev/null
install -m 755 sendtelegram.sh /usr/local/bin/sendtelegram.sh
sendtelegram.sh -m "Base system ready"

All other scripts call sendtelegram.sh or source /etc/telegramrc directly.

Automatic Updates

cat > /etc/dnf/automatic.conf <<'EOF'
[commands]
upgrade_type = default
random_sleep = 0
download_updates = yes
apply_updates = yes
reboot = when-needed
reboot_command = "shutdown -r +5 'Reboot after automatic update'"

[emitters]
emit_via = stdio
EOF

systemctl enable --now dnf5-automatic.timer

Mask passim

passim is a P2P cache for fwupd — useless on a headless VPS:

systemctl mask passim
etckeeper commit "base system: ssh/222, packages, sysctl, telegram, dnf-automatic"
snapper -c root create -d "basis-system-fertig"

Step 2 — Btrfs Storage Layout

Why Separate Subvolumes?

Granular subvolumes allow btrbk to snapshot and send only what changed. You can exclude volatile data (/var/cache, /var/tmp) from backups while still including critical data (/var/lib/microshift, PVCs, GitOps working tree).

Subvolume Creation

Fedora installs / as a root subvolume. Mount the top-level and add the rest:

mkdir -p /mnt/btrfs-top
mount -o subvolid=5 /dev/vda3 /mnt/btrfs-top

for sv in var home var_log var_cache var_tmp \
          var_lib_microshift var_lib_pvc var_lib_containers var_lib_kubelet \
          data btrbk_snapshots; do
  btrfs subvolume create /mnt/btrfs-top/$sv
done

umount /mnt/btrfs-top

/etc/fstab Extension

Get your Btrfs UUID with blkid /dev/vda3, then add to /etc/fstab:

UUID=<btrfs-uuid> /var                btrfs  subvol=var,compress=zstd:1  0 0
UUID=<btrfs-uuid> /home               btrfs  subvol=home,compress=zstd:1 0 0
UUID=<btrfs-uuid> /var/log            btrfs  subvol=var_log              0 0
UUID=<btrfs-uuid> /var/cache          btrfs  subvol=var_cache            0 0
UUID=<btrfs-uuid> /var/tmp            btrfs  subvol=var_tmp              0 0
UUID=<btrfs-uuid> /var/lib/microshift btrfs  subvol=var_lib_microshift   0 0
UUID=<btrfs-uuid> /var/lib/pvc        btrfs  subvol=var_lib_pvc          0 0
UUID=<btrfs-uuid> /var/lib/containers btrfs  subvol=var_lib_containers   0 0
UUID=<btrfs-uuid> /var/lib/kubelet    btrfs  subvol=var_lib_kubelet      0 0
UUID=<btrfs-uuid> /data               btrfs  subvol=data                 0 0
UUID=<btrfs-uuid> /mnt/btrfs-top      btrfs  subvolid=5,noauto           0 0
mkdir -p /var/lib/{microshift,pvc,containers,kubelet} /data /mnt/btrfs-top
systemctl daemon-reload && mount -a

SELinux Context for PVCs

The local-path-provisioner init container (busybox) cannot run chcon. Set the SELinux label from the host:

semanage fcontext -a -t container_file_t "/var/lib/pvc(/.*)?"
restorecon -Rv /var/lib/pvc

snapper — Timeline Snapshots

Eight subvolumes get timeline snapshots — all with identical retention (24h/8d/5w, numbered limit 20):

# Root
snapper -c root create-config /
snapper -c root set-config \
  TIMELINE_CREATE=yes TIMELINE_CLEANUP=yes \
  TIMELINE_LIMIT_HOURLY=24 TIMELINE_LIMIT_DAILY=8 \
  TIMELINE_LIMIT_WEEKLY=5 TIMELINE_LIMIT_MONTHLY=0 TIMELINE_LIMIT_YEARLY=0 \
  NUMBER_CLEANUP=yes NUMBER_LIMIT=20

# All other subvolumes — same retention
for CFG_SUBVOL in \
  "home:/home" "data:/data" \
  "var_lib_pvc:/var/lib/pvc" "var_lib_microshift:/var/lib/microshift" \
  "var:/var" "var_log:/var/log" "var_lib_containers:/var/lib/containers"; do
  CFG="${CFG_SUBVOL%%:*}"
  SUBVOL="${CFG_SUBVOL##*:}"
  snapper -c "$CFG" create-config "$SUBVOL"
  snapper -c "$CFG" set-config \
    TIMELINE_CREATE=yes TIMELINE_CLEANUP=yes \
    TIMELINE_LIMIT_HOURLY=24 TIMELINE_LIMIT_DAILY=8 \
    TIMELINE_LIMIT_WEEKLY=5 TIMELINE_LIMIT_MONTHLY=0 TIMELINE_LIMIT_YEARLY=0 \
    NUMBER_CLEANUP=yes NUMBER_LIMIT=20
done

systemctl enable --now snapper-timeline.timer snapper-cleanup.timer

snap-all — convenience script that creates a numbered snapshot across all eight configs at once (used before risky changes):

cat > /usr/local/bin/snap-all <<'EOF'
#!/bin/bash
set -euo pipefail
DESC="${1:?Verwendung: snap-all <beschreibung>}"
for cfg in root home data var_lib_pvc var_lib_microshift var var_log var_lib_containers; do
  printf "  %-22s ... " "$cfg"
  snapper -c "$cfg" create --cleanup-algorithm number --description "$DESC"
  echo "ok"
done
EOF
chmod 755 /usr/local/bin/snap-all

Usage: snap-all "before-risky-change" — creates one snapshot per config, all with --cleanup-algorithm number so they count against NUMBER_LIMIT and are eventually pruned automatically.

grub-btrfs — Rollback from GRUB

grub-btrfs is not in Fedora repos — build from source:

dnf install -y make gettext
git clone https://github.com/Antynea/grub-btrfs /root/git/grub-btrfs
cd /root/git/grub-btrfs && make install

# Fedora-specific paths
sed -i \
  -e 's|#GRUB_BTRFS_GRUB_DIRNAME=.*|GRUB_BTRFS_GRUB_DIRNAME="/boot/grub2"|' \
  -e 's|#GRUB_BTRFS_SCRIPT_CHECK=.*|GRUB_BTRFS_SCRIPT_CHECK=grub2-script-check|' \
  /etc/default/grub-btrfs/config

systemctl enable --now grub-btrfsd.service
grub2-mkconfig -o /boot/grub2/grub.cfg

The daemon watches /.snapshots via inotify and adds new snapshots to the GRUB menu automatically.

btrbk — Snapshots + Remote Backup

cat > /etc/btrbk/btrbk.conf <<'EOF'
timestamp_format        long
# Local: keep only the latest snapshot as parent reference for incremental send
# snapper handles local rollback, btrbk local snapshots are minimal
snapshot_preserve_min   latest
snapshot_preserve       2h 1d 0w
# Remote (Pi): full retention history
target_preserve_min     1h
target_preserve         24h 8d 5w

ssh_identity            /root/.ssh/id_ed25519

volume /mnt/btrfs-top
  snapshot_dir  btrbk_snapshots

  subvolume root
  subvolume var
  subvolume home
  subvolume var_log
  subvolume var_lib_pvc
  subvolume var_lib_microshift
  subvolume var_lib_containers
  subvolume data

  target send-receive   ssh://backupuser@10.0.0.12/backup/btrfs/server
EOF

systemctl enable --now btrbk.timer

The remote target (Raspberry Pi) is only reachable after WireGuard is configured (Step 3). Until then, btrbk creates local snapshots only.

⚠️ Do NOT Enable Btrfs Quotas

This is a hard-won lesson: Btrfs quotas and etcd are incompatible.

When quotas are enabled, every snapshot deletion (btrbk and snapper-cleanup run hourly/daily) triggers a qgroup rescan. With hundreds of snapshots across many subvolumes, this rescan can block Btrfs metadata operations — including fsync — for up to 15 minutes.

etcd fails with DeadlineExceeded if fsync stalls for more than 5 seconds → kube-apiserver hangs → MicroShift crashes. The MicroShift restart loop persists until the rescan completes, then crashes again at the next btrbk run.

Symptom: MicroShift crashes at the same time every hour, exactly 15 minutes after btrbk runs. journalctl -u microshift -n 100 shows etcd fsync errors.

Fix: btrfs quota disable /

btrfs-list (a useful subvolume overview tool) works without quotas — the REFER/EXCL columns show empty, but all other data is present.

Why Btrfs has this problem (and ZFS doesn’t): ZFS encodes block ownership in the block itself at write time (birth_txg). Space accounting is O(1) per write, always consistent, and snapshot deletion is processed asynchronously via a per-snapshot “deadlist” — no global rescan, never blocking. Btrfs introduced snapshots without built-in ownership metadata, so qgroups must walk backreferences retroactively to determine who owns what — expensive and blocking.

What about simple quotas (squota)? Since kernel 6.7, Btrfs offers an alternative mode (btrfs quota enable --simple /) that attributes extents permanently to their creating subvolume — no backref walking, no rescan, O(1) per operation. This is safe for etcd. The trade-off: snapshots show ~0 exclusive usage (all extents stay attributed to the original subvolume), so snapshot size measurement is not possible. For this setup, quotas remain disabled — squota doesn’t provide useful space accounting for snapshots either.

full qgroupsquotadisabledZFS
How ownership is determinedbackref walk (expensive)creator subvolume (O(1))birth_txg (O(1))
When accounting happensretroactively (rescan)inlineinline, atomic
Snapshot deleteblocking rescanno rescanasync deadlist
Numbers always correctno (inconsistent flag)no (snapshots ~0)yes
etcd-safenoyesyes
Snapshot size measurableyesnonoyes

/boot Backup

cat > /usr/local/sbin/boot-backup.sh <<'EOF'
#!/bin/bash
set -euo pipefail
DEST=/var/lib/boot-backup
mkdir -p "$DEST/boot" "$DEST/efi"
rsync -aAX --delete /boot/     "$DEST/boot/"
rsync -aAX --delete /boot/efi/ "$DEST/efi/"
sfdisk --dump /dev/vda > "$DEST/partition-table.sfdisk"
sgdisk --backup="$DEST/partition-table.sgdisk" /dev/vda
EOF
chmod 755 /usr/local/sbin/boot-backup.sh

# systemd timer (daily + after every dnf5-automatic run)
# See full unit files in the repo
systemctl enable --now boot-backup.timer

Step 3 — Firewall & WireGuard

firewalld Zones

The setup uses three zones:

ZoneInterface/SourcePurpose
FedoraServerens3 (public NIC)External traffic, explicit whitelist
internalwg0 (WireGuard)VPN clients — trusted, full access
trusted10.42.0.0/16 (Pod CIDR)Pod-to-pod + kubelet traffic
# FedoraServer zone — remove defaults, add only what's needed
firewall-cmd --permanent --zone=FedoraServer --remove-service=cockpit
firewall-cmd --permanent --zone=FedoraServer --remove-service=dhcpv6-client

# Custom SSH service on port 222
firewall-cmd --permanent --new-service=myssh
firewall-cmd --permanent --service=myssh --add-port=222/tcp
firewall-cmd --permanent --zone=FedoraServer --add-service=myssh
firewall-cmd --permanent --zone=FedoraServer --remove-service=ssh

# HTTP(S) for ingress
firewall-cmd --permanent --zone=FedoraServer --add-port=80/tcp
firewall-cmd --permanent --zone=FedoraServer --add-port=443/tcp

# Mail ports
for p in 25 587 993; do
  firewall-cmd --permanent --zone=FedoraServer --add-port=${p}/tcp
done

# WireGuard
firewall-cmd --permanent --zone=FedoraServer --add-port=51820/udp

# acme-dns (DNS-01 challenge server)
firewall-cmd --permanent --zone=FedoraServer --add-port=53/tcp
firewall-cmd --permanent --zone=FedoraServer --add-port=53/udp

# NAT for WireGuard clients
firewall-cmd --permanent --zone=FedoraServer --add-masquerade

# internal zone — wg0 interface
firewall-cmd --permanent --zone=internal --add-interface=wg0
for p in 25 53 80 222 443 587 993 6443 8000 10250; do
  firewall-cmd --permanent --zone=internal --add-port=${p}/tcp
done
firewall-cmd --permanent --zone=internal --add-port=53/udp

# trusted zone — Pod CIDR
firewall-cmd --permanent --zone=trusted --add-source=10.42.0.0/16
firewall-cmd --permanent --zone=trusted --add-source=169.254.169.1/32
firewall-cmd --permanent --zone=trusted --add-port=30000-32767/tcp

Cross-Zone Forwarding Policies

firewalld requires explicit policies for cross-zone forwarding — plain rules don’t cover forwarded packets:

# WireGuard clients → Internet (NAT)
firewall-cmd --permanent --new-policy=wg-to-internet
firewall-cmd --permanent --policy=wg-to-internet --add-ingress-zone=internal
firewall-cmd --permanent --policy=wg-to-internet --add-egress-zone=FedoraServer
firewall-cmd --permanent --policy=wg-to-internet --set-target=ACCEPT

# WireGuard clients → Pods (via HAProxy DNAT)
firewall-cmd --permanent --new-policy=wg-to-cluster
firewall-cmd --permanent --policy=wg-to-cluster --add-ingress-zone=internal
firewall-cmd --permanent --policy=wg-to-cluster --add-egress-zone=trusted
firewall-cmd --permanent --policy=wg-to-cluster --set-target=ACCEPT

firewall-cmd --reload

Critical: The --add-interface=wg0 for the internal zone must be --permanent. Without it, after a reload the policy zone assignment breaks and VPN clients lose cluster access.

WireGuard Server

cd /etc/wireguard && umask 077
wg genkey | tee server.key | wg pubkey > server.pub

/etc/wireguard/wg0.conf:

[Interface]
Address    = 10.0.0.1/24
ListenPort = 51820
PrivateKey = <CONTENTS OF server.key>

PostUp   = iptables -A FORWARD -i %i -j ACCEPT; iptables -A FORWARD -o %i -j ACCEPT
PostDown = iptables -D FORWARD -i %i -j ACCEPT; iptables -D FORWARD -o %i -j ACCEPT

[Peer]
# Example client
PublicKey  = <client-pubkey>
AllowedIPs = 10.0.0.2/32

# Add one [Peer] block per device

Client configuration (use the WireGuard app):

  • Endpoint: yourserver.example.com:51820
  • AllowedIPs: 0.0.0.0/0 (route all traffic through VPN)
  • DNS: 10.0.0.1 (Pihole — deployed later; fall back to 1.1.1.1 initially)
chmod 600 /etc/wireguard/wg0.conf
systemctl enable --now wg-quick@wg0

fail2ban

Phase 1: Base config (SSH only, before mailstack)

cat > /etc/fail2ban/jail.local <<'EOF'
[DEFAULT]
bantime  = 86400
findtime = 259200
maxretry = 3
backend  = auto
action   = firewallcmd-rich-rules[actiontype=<multiport>]

[sshd]
enabled  = true
port     = 222
logpath  = %(sshd_log)s
EOF

systemctl enable --now fail2ban

Phase 2: After mailstack deployment

The mail logs live inside a PVC mounted at /var/lib/pvc/<pvc-uuid>_mailstack_mail-logs/. Get the UUID:

kubectl get pvc -n mailstack mail-logs -o jsonpath='{.spec.volumeName}'

Add to /etc/fail2ban/jail.local (replace <PVC_UUID> with the value above):

[postfix]
enabled  = true
port     = smtp,submission
backend  = polling
logpath  = /var/lib/pvc/<PVC_UUID>_mailstack_mail-logs/postfix.log

[postfix-sasl]
enabled  = true
port     = smtp,submission
backend  = polling
logpath  = /var/lib/pvc/<PVC_UUID>_mailstack_mail-logs/postfix.log

[dovecot]
enabled  = true
port     = imaps
backend  = polling
logpath  = /var/lib/pvc/<PVC_UUID>_mailstack_mail-logs/dovecot.log

[postfix-sasl-unknown]
enabled        = true
filter         = postfix-sasl-unknown
action         = firewallcmd-rich-rules[actiontype=<allports>]
ignorecommand  = /usr/local/bin/fail2ban-sasl-mysql-check.sh <ip>
backend        = polling
logpath        = /var/lib/pvc/<PVC_UUID>_mailstack_mail-logs/postfix.log
maxretry       = 1
bantime        = 604800
findtime       = 86400

[sshd-unknown]
enabled        = true
port           = 222
backend        = systemd
maxretry       = 1
bantime        = 604800

[dovecot-unknown]
enabled        = true
port           = imaps
filter         = dovecot
backend        = polling
logpath        = /var/lib/pvc/<PVC_UUID>_mailstack_mail-logs/dovecot.log
maxretry       = 1
bantime        = 604800

[postfix-rcpt-unknown]
enabled        = true
port           = smtp,submission
filter         = postfix
backend        = polling
logpath        = /var/lib/pvc/<PVC_UUID>_mailstack_mail-logs/postfix.log
maxretry       = 3
findtime       = 3600
bantime        = 604800

postfix-sasl-unknown — smart SASL jail

This custom jail bans on the first failed SASL login, but only if the attempted username does not exist in the mailbox database. Legitimate users who mistype their password from a new IP are not banned; automated scanners using random addresses are blocked immediately for 7 days.

Custom filter /etc/fail2ban/filter.d/postfix-sasl-unknown.conf:

[INCLUDES]
before = common.conf

[Definition]
failregex = warning: \S+\[<HOST>\](?::\d+)?: SASL \S+ authentication failed: .*, sasl_username=<F-USER>\S+</F-USER>
ignoreregex =
datepattern = %%b %%d %%H:%%M:%%S
              {^LN-BEG}

MySQL check script /usr/local/bin/fail2ban-sasl-mysql-check.sh (return 0 = do NOT ban, return 1 = ban):

#!/bin/bash
IP="$1"
LOG="/var/lib/pvc/<PVC_UUID>_mailstack_mail-logs/postfix.log"
source /etc/fail2ban/.mysql-sasl-check.conf

USER=$(grep -F "[$IP]" "$LOG" | grep "SASL.*authentication failed" | \
    grep -oP 'sasl_username=\K\S+' | tail -1)

[ -z "$USER" ] && exit 1
echo "$USER" | grep -qP '^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$' || exit 1

SAFE_USER="${USER//\'/\'\'}"
COUNT=$(mysql -h "$MYSQL_HOST" -u "$MYSQL_USER" -p"$MYSQL_PASS" "$MYSQL_DB" -sN \
    -e "SELECT COUNT(*) FROM mailbox WHERE username='$SAFE_USER' AND active=1" 2>/dev/null)

[ "$COUNT" = "1" ] && exit 0 || exit 1
chmod 755 /usr/local/bin/fail2ban-sasl-mysql-check.sh

MySQL credentials in /etc/fail2ban/.mysql-sasl-check.conf (chmod 600). The MYSQL_HOST is the cluster IP of the maildb service:

# MYSQL_HOST: kubectl get svc -n mailstack maildb -o jsonpath='{.spec.clusterIP}'
install -m 600 /dev/stdin /etc/fail2ban/.mysql-sasl-check.conf <<'EOF'
MYSQL_HOST=<maildb-cluster-ip>
MYSQL_USER=postfix
MYSQL_PASS=<postfix-db-password>
MYSQL_DB=postfix
EOF

systemctl restart fail2ban

Auto-unban WireGuard clients

Prevent VPN lockout from SSH brute-force detection:

cat > /usr/local/bin/unban-fail2ban-clients.sh <<'EOF'
#!/bin/bash
for JAIL in $(fail2ban-client status | awk -F: '/Jail list/ {print $2}' | tr -d ',\t'); do
  for IP in $(fail2ban-client status "$JAIL" | awk -F: '/Banned IP list/ {print $2}'); do
    case "$IP" in
      10.0.0.*) fail2ban-client set "$JAIL" unbanip "$IP" ;;
    esac
  done
done
EOF
chmod 755 /usr/local/bin/unban-fail2ban-clients.sh
# /etc/systemd/system/unban-fail2ban-clients.timer
[Unit]
Description=Hourly unban of WireGuard IPs
[Timer]
OnCalendar=hourly
Persistent=true
[Install]
WantedBy=timers.target
systemctl enable --now unban-fail2ban-clients.timer

Step 4 — MicroShift

Install

dnf copr enable -y @redhat-et/microshift
dnf install -y microshift openshift-clients   # installs kubectl + oc

Configuration: Replace CNI and Storage

MicroShift defaults to OVN-Kubernetes (CNI) and TopoLVM (storage). Both are replaced:

install -m 644 /dev/stdin /etc/microshift/config.yaml <<'EOF'
network:
  cniPlugin: "none"
storage:
  driver: none
  optionalCsiComponents:
    - none
apiServer:
  logLevel: Warning
auditLog:
  maxFileSize: 200
  maxFiles: 3
  maxFileAge: 7
EOF

MicroShift 4.21+ ships internal kindnet manifests under /usr/lib/microshift/manifests.d/000-microshift-kindnet/ with POD_SUBNET=10.244.0.0/16 (kind’s default). cniPlugin: none disables OVN-K but not this internal kindnet set. Override kustomizePaths to exclude it:

install -m 644 /dev/stdin /etc/microshift/config.d/02-kindnet-paths.yaml <<'EOF'
manifests:
  kustomizePaths:
  - /usr/lib/microshift/manifests
  - /usr/lib/microshift/manifests.d/000-microshift-kube-proxy
  - /etc/microshift/manifests
  - /etc/microshift/manifests.d/*
EOF

This keeps kube-proxy from the system set while letting your own kindnet manifest (with the correct CIDR) be the sole source of truth.

CNI: kube-kindnet

Place the kindnet DaemonSet manifest in /etc/microshift/manifests/kindnet.yaml. The critical configuration:

# In the DaemonSet env section:
- name: POD_SUBNET
  value: "10.42.0.0/16"

Why not the default 10.244.0.0/16? MicroShift internally uses 10.42.0.0/16 for the pod CIDR. kindnet’s KIND-MASQ-AGENT chain uses this value to decide which traffic to masquerade. If it doesn’t match, requests from specific IP ranges (e.g., Pihole’s whitelist logic) are incorrectly masqueraded.

kindnet manifest

Place the kindnet DaemonSet in /etc/microshift/manifests/kindnet.yaml as two documents separated by --- (Namespace + DaemonSet). The critical env var:

Storage: local-path-provisioner

Use the standard rancher/local-path-provisioner manifest with two Fedora-specific changes:

  1. Full registry paths: Fedora blocks short image names. rancher/local-path-provisioner:v0.xdocker.io/rancher/local-path-provisioner:v0.x, busybox:latestdocker.io/library/busybox:latest.
  2. Remove chcon from the setup script: busybox doesn’t have chcon. The SELinux label for /var/lib/pvc was set by semanage in Step 2 — no runtime chcon needed.

Add SCC bindings for privileged and hostmount-anyuid in the same manifest.

Start MicroShift

systemctl enable --now microshift

export KUBECONFIG=/var/lib/microshift/resources/kubeadmin/kubeconfig
echo 'export KUBECONFIG=/var/lib/microshift/resources/kubeadmin/kubeconfig' >> /root/.bashrc
echo 'alias k=kubectl' >> /root/.bashrc

kubectl get pods -A -w

Expected system pods once stable:

  • openshift-ingress/router-default — HAProxy
  • openshift-dns/dns-default + node-resolver
  • kube-kindnet/kube-kindnet-ds-*
  • kube-system/csi-snapshot-controller
  • local-path-storage/local-path-provisioner

Step 5 — GitOps with Flux CD

SSH Key for GitHub

ssh-keygen -t ed25519 -C "fedora-server" -f /root/.ssh/id_ed25519 -N ""
cat /root/.ssh/id_ed25519.pub
# → Add as Deploy Key (with write access) to github.com/youruser/homelab

SCC Bindings for Flux

MicroShift’s SCCs reject Flux’s default fsGroup: 1337. Create bindings before bootstrapping, so Flux pods can start:

cat > /etc/microshift/manifests/flux-scc.yaml <<'EOF'
apiVersion: v1
kind: Namespace
metadata:
  name: flux-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: flux-source-controller-scc
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:openshift:scc:privileged
subjects:
  - kind: ServiceAccount
    name: source-controller
    namespace: flux-system
---
# Repeat for: kustomize-controller, helm-controller, notification-controller
EOF

systemctl restart microshift

Flux Bootstrap

curl -s https://fluxcd.io/install.sh | FLUX_VERSION=2.8.6 bash -s /usr/local/bin

flux bootstrap git \
  --url=ssh://git@github.com/youruser/homelab \
  --branch=main \
  --path=configuration/ \
  --private-key-file=/root/.ssh/id_ed25519 \
  --silent

Clone the working tree to /data (backed by Btrfs, included in backups):

cd /data && git clone git@github.com:youruser/homelab.git .

Telegram Notifications for Flux Events

kubectl create secret generic telegram-secret -n flux-system \
  --from-literal=token="$(awk -F= '/^TOKEN=/{print $2}' /etc/telegramrc)"

Then create a Provider (type: telegram) and Alert resource in the flux-system namespace. Key config: use exclusionList to filter transient retry messages (otherwise every health-check failure sends a notification):

# Alert.spec.exclusionList:
- ".*Reconciliation in progress.*"
- ".*retry budget.*"

Pre-Update Btrfs Snapshot via GitHub Action

Before every push to main, a GitHub Action SSHes to the server and creates a Btrfs snapshot. This gives you a clean rollback point if a Flux reconciliation breaks something.

Restricted SSH key (command-restricted, no PTY):

ssh-keygen -t ed25519 -f /root/.ssh/github-actions -N "" -C "gh-actions-snapshot"

KEY=$(cat /root/.ssh/github-actions.pub)
echo "command=\"/usr/local/bin/pre-flux-snapshot.sh\",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty $KEY" \
  >> /root/.ssh/authorized_keys
cat > /data/scripts/pre-flux-snapshot.sh <<'EOF'
#!/bin/bash
export KUBECONFIG=/var/lib/microshift/resources/kubeadmin/kubeconfig
flux suspend ks --all
snapper -c root create -d "vor-flux-update-$(date +%Y%m%d-%H%M%S)"
flux resume ks --all
EOF
chmod 755 /data/scripts/pre-flux-snapshot.sh
ln -s /data/scripts/pre-flux-snapshot.sh /usr/local/bin/pre-flux-snapshot.sh

Store the private key as FEDORA_SSH_KEY secret in the GitHub repo.

Renovate Bot — Automatic Image Updates

Renovate runs as a GitHub Action every 6 hours and opens PRs for new container image tags. Configuration:

{
  "extends": ["config:base"],
  "automergeType": "pr",
  "prHourlyLimit": 2,
  "packageRules": [
    {
      "matchUpdateTypes": ["patch", "minor"],
      "automerge": true,
      "minimumReleaseAge": "3 days"
    },
    {
      "matchUpdateTypes": ["major"],
      "automerge": false
    }
  ]
}

The Btrfs snapshot GitHub Action runs before Renovate’s auto-merges, so every automated image update has a rollback point.


Step 6 — Wildcard TLS with cert-manager + acme-dns

Architecture

architecture-beta
  group certns(cloud)[cert_manager namespace]

  service cname(internet)[acme challenge CNAME]
  service acmedns(server)[acme_dns port 53] in certns
  service issuer(server)[ClusterIssuer DNS 01] in certns
  service secret(disk)[wildcard tls Secret] in certns
  service sync(server)[sync wildcard tls]
  service appns(disk)[app namespace secrets]

  cname:R --> L:acmedns
  issuer:T --> B:acmedns
  issuer:B --> T:secret
  secret:B --> T:sync
  sync:B --> T:appns

Why acme-dns instead of direct DNS-01? Most domain registrars don’t offer a usable DNS API for cert-manager. acme-dns is a minimal DNS server that only handles TXT records for ACME challenges — you point a CNAME at it once, and cert-manager does the rest via a stable API.

Deploy acme-dns via Flux

Create gitops/acme-dns/ manifests (Deployment, Service, PVC, ConfigMap) and apply via Flux. Key points:

  • Port 53 exposed via a ClusterIP Service with externalIPs: [<YOUR_SERVER_IP>] — kube-proxy creates a single DNAT rule directly on the public IP; no hostPort needed, RollingUpdate strategy works because the Service endpoint is managed by kube-proxy independently of the pod lifecycle
  • PVC for the SQLite credential database

Register one account per apex domain:

POD=$(kubectl get pod -n acme-dns -l app=acme-dns -o name)
for DOMAIN in example.com example.net; do
  kubectl exec -n acme-dns "$POD" -- curl -s -X POST http://localhost:8081/register \
    -H 'Content-Type: application/json' -d "{}"
done
# Save the JSON output — contains username, password, fulldomain per domain

Store as a Kubernetes secret:

install -m 600 <json-file> /etc/acme-dns/acmedns.json
kubectl create secret generic acmedns-credentials -n cert-manager \
  --from-file=acmedns.json=/etc/acme-dns/acmedns.json

CNAME Records (one-time)

At your registrar, for each apex domain:

_acme-challenge.example.com   CNAME   <fulldomain-from-json>

cert-manager and Certificate

Deploy cert-manager via Helm (Flux HelmRelease). The Certificate resource covers all your domains:

spec:
  secretName: wildcard-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - "*.example.com"
    - example.com
    - "*.example.net"
    - example.net

Sync to App Namespaces

Kubernetes requires TLS secrets to be in the same namespace as the Ingress. Sync the wildcard secret weekly:

cat > /usr/local/bin/sync-wildcard-tls.sh <<'EOF'
#!/bin/bash
set -euo pipefail
export KUBECONFIG=/var/lib/microshift/resources/kubeadmin/kubeconfig

NAMESPACES=(pihole vaultwarden nextcloud paperless collabora mailstack homepage)

TMP=$(mktemp); trap 'rm -f "$TMP"' EXIT
kubectl get secret wildcard-tls -n cert-manager -o json > "$TMP"

for NS in "${NAMESPACES[@]}"; do
  jq --arg ns "$NS" '
    .metadata.name = ($ns + "-tls")
    | .metadata.namespace = $ns
    | del(.metadata.resourceVersion, .metadata.uid,
          .metadata.creationTimestamp, .metadata.ownerReferences,
          .metadata.managedFields)
  ' "$TMP" | kubectl apply -f -
done
EOF
chmod 755 /usr/local/bin/sync-wildcard-tls.sh

# systemd timer: every Monday 00:00
systemctl enable --now sync-wildcard-tls.timer
/usr/local/bin/sync-wildcard-tls.sh   # run immediately

When adding a new service with a TLS ingress: add its namespace to NAMESPACES=(), commit, and run the script once manually.


Step 7 — Services

All services follow the same pattern:

  1. Manifests in configuration/<name>/ + Kustomization CR configuration/<name>-ks.yaml
  2. Secrets created manually with kubectl create secret (never in git)
  3. flux reconcile kustomization <name> --with-source
  4. Verify with kubectl get pods -n <namespace>

Here are the notable per-service details:

Pihole

Runs as the DNS server for all WireGuard clients. Uses hostPort: 53 on the WireGuard interface so clients can point directly to 10.0.0.1 as their DNS server.

# configuration/pihole/deployment.yaml (excerpt)
strategy:
  type: Recreate    # prevents Pending state when hostPort: 53 is already bound
containers:
- name: pihole
  image: docker.io/pihole/pihole:latest
  env:
  - name: FTLCONF_dns_upstreams
    value: "8.8.8.8;8.8.4.4"
  - name: FTLCONF_dns_listeningMode
    value: "ALL"
  - name: FTLCONF_webserver_api_password
    valueFrom:
      secretKeyRef:
        name: pihole-password
        key: FTLCONF_webserver_api_password
  ports:
  - containerPort: 53
    hostPort: 53
    hostIP: 10.0.0.1        # bind to WireGuard interface only
    protocol: UDP
    name: dns-udp
  - containerPort: 53
    hostPort: 53
    hostIP: 10.0.0.1
    protocol: TCP
    name: dns-tcp
  - containerPort: 80
    name: http
  securityContext:
    privileged: true        # required for FTL (NET_ADMIN)
  volumeMounts:
  - name: pihole-data
    mountPath: /etc/pihole
  - name: dnsmasq-custom
    mountPath: /etc/dnsmasq.d/99-custom.conf
    subPath: 99-custom.conf
# configuration/pihole/ingress.yaml
metadata:
  annotations:
    route.openshift.io/termination: "edge"
    haproxy.router.openshift.io/ip_whitelist: "10.0.0.0/24 <CORP_IP_RANGE>"
spec:
  tls:
  - hosts: [dnsconfig.example.com]
    secretName: pihole-tls
  rules:
  - host: dnsconfig.example.com
# configuration/pihole/scc.yaml — required for privileged workloads
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: pihole-privileged-scc
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:openshift:scc:privileged
subjects:
- kind: ServiceAccount
  name: default
  namespace: pihole

After deploying: add local DNS entries in Pihole for all services (so service.example.com resolves to 10.0.0.1 for VPN clients, bypassing public DNS).

Vaultwarden

Small footprint: 2 Gi PVC, 100 Mi memory limit. Access restricted to VPN + any additional IP ranges.

# configuration/vaultwarden/deployment.yaml (excerpt)
containers:
- name: vaultwarden
  image: docker.io/vaultwarden/server:1.35.7
  env:
  - name: WEBSOCKET_ENABLED
    value: "true"
  - name: ROCKET_ADDRESS
    value: "0.0.0.0"
  resources:
    limits:
      memory: 100Mi
  volumeMounts:
  - name: data
    mountPath: /data
  strategy:
    type: Recreate   # important: only one writer at a time
# configuration/vaultwarden/ingress.yaml (excerpt)
annotations:
  haproxy.router.openshift.io/ip_whitelist: "10.0.0.0/24 <CORP_IP_RANGE>"
  haproxy.router.openshift.io/timeout: "300s"

Admin token: openssl rand -base64 48 | tr -d '\n'

Paperless-NGX

  • Four PVCs: data (PostgreSQL), media, consume, export
  • Critical for restore: always restore paperless-data and paperless-media from the same point in time — they must stay in sync

Nextcloud + MariaDB

# configuration/nextcloud/deployment.yaml (excerpt)
containers:
- name: nextcloud
  image: docker.io/nextcloud:33-apache
  env:
  - name: MYSQL_HOST
    value: mariadb
  - name: MYSQL_DATABASE
    value: owncloud
  - name: MYSQL_USER
    value: owncloud
  - name: MYSQL_PASSWORD
    valueFrom:
      secretKeyRef:
        name: nextcloud-secret
        key: MYSQL_PASSWORD
  - name: NEXTCLOUD_TRUSTED_DOMAINS
    value: cloud.example.com
  - name: REDIS_HOST
    value: redis
  - name: PHP_UPLOAD_LIMIT
    value: 10G
  - name: PHP_MEMORY_LIMIT
    value: 1G
  - name: APACHE_DISABLE_REWRITE_IP
    value: "1"
  - name: TRUSTED_PROXIES
    value: 10.42.0.0/16    # pod CIDR — required for HAProxy reverse-proxy headers
  resources:
    limits:
      memory: 2Gi
# configuration/nextcloud/ingress.yaml
annotations:
  haproxy.router.openshift.io/timeout: "300s"
  haproxy.router.openshift.io/proxy-body-size: "10g"
  haproxy.router.openshift.io/hsts_header: max-age=15552000;includeSubDomains;preload

Two CronJobs in the same file — Nextcloud’s background job runner (every 5 min) and automatic app updates (daily at 03:00):

# configuration/nextcloud/cronjob.yaml
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nextcloud-cron
spec:
  schedule: "*/5 * * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          securityContext:
            runAsUser: 33    # www-data
            runAsGroup: 33
          containers:
          - name: nextcloud-cron
            image: docker.io/nextcloud:33-apache
            command: ["php", "-f", "/var/www/html/cron.php"]
            volumeMounts:
            - {name: html, mountPath: /var/www/html}
            - {name: data, mountPath: /var/www/html/data}
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nextcloud-app-update
spec:
  schedule: "0 3 * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          securityContext:
            runAsUser: 33
            runAsGroup: 33
          containers:
          - name: nextcloud-app-update
            image: docker.io/nextcloud:33-apache
            command: ["php", "-f", "/var/www/html/occ", "app:update", "--all"]

Mailstack

The most complex service — seven interdependent components:

architecture-beta
  group mailstack(cloud)[mailstack namespace]

  service db(database)[MariaDB] in mailstack
  service pa(server)[PostfixAdmin] in mailstack
  service postfix(server)[Postfix SMTP] in mailstack
  service dovecot(server)[Dovecot IMAP] in mailstack
  service rspamd(server)[Rspamd] in mailstack
  service redis(database)[Redis] in mailstack
  service clamav(server)[ClamAV] in mailstack

  db:T --> B:pa
  db:L --> R:postfix
  db:R --> L:dovecot
  postfix:T -- B:rspamd
  postfix:B --> T:dovecot
  rspamd:L --> R:redis
  rspamd:T --> B:clamav

Postfix: Uses hostNetwork: true so that connections arrive at Postfix with the real client IP — critical for fail2ban to see the actual source address. hostPort alone routes connections through the pod bridge (kindnet), which replaces the source IP with the bridge address (10.42.0.1). Config files (main.cf, master.cf, MySQL maps) are mounted from a ConfigMap. Mail logs are written to a shared PVC (mail-logs) that fail2ban reads from the host.

# configuration/mailstack/postfix-deployment.yaml (excerpt)
strategy:
  type: Recreate
template:
  spec:
    hostNetwork: true
    dnsPolicy: ClusterFirstWithHostNet
    containers:
    - name: postfix
      image: docker.io/boky/postfix:latest
      command: ["/usr/sbin/postfix", "-c", "/etc/postfix", "start-fg"]
      securityContext:
        privileged: true
      volumeMounts:
      - {name: config, mountPath: /etc/postfix/main.cf, subPath: main.cf}
      - {name: tls,    mountPath: /etc/ssl/mail,         readOnly: true}
      - {name: mail-logs, mountPath: /var/log/mail}
    - name: log-tailer   # sidecar: tails postfix.log to stdout for kubectl logs
      image: docker.io/library/alpine:3
      command: ["/bin/sh", "-c", "tail -F /var/log/mail/postfix.log"]
      volumeMounts:
      - {name: mail-logs, mountPath: /var/log/mail}
    - name: logrotate   # sidecar: daily rotation of postfix.log
      image: docker.io/library/alpine:3
      command: ["/bin/sh", "-c"]
      args: ["apk add logrotate -q && while true; do logrotate ... || { echo 'ERROR'; exit 1; }; sleep 86400; done"]
      volumeMounts:
      - {name: mail-logs, mountPath: /var/log/mail}

Dovecot 2.4 breaking changes (if migrating from 2.3):

  • MySQL connection config can no longer use %{env:VAR} in the mysql{} block — mount the credentials as a Kubernetes Secret file instead
  • encrypt = dovecot:SHA512-CRYPT renamed to encrypt = system
  • ssl_protocols removed, ssl_min_protocol is the new knob
# configuration/mailstack/dovecot-deployment.yaml (excerpt)
strategy:
  type: Recreate
template:
  spec:
    hostNetwork: true
    dnsPolicy: ClusterFirstWithHostNet
    containers:
    - name: dovecot
      image: docker.io/dovecot/dovecot:latest-root
      ports:
      - {containerPort: 993}   # metadata only — hostNetwork binds to host stack directly
      securityContext:
        privileged: true
      volumeMounts:
      - {name: config,    mountPath: /etc/dovecot/conf.d/10-auth.conf, subPath: 10-auth.conf}
      - {name: db-secret, mountPath: /etc/dovecot/conf.d/01-db-connection.conf,
         subPath: 01-db-connection.conf, readOnly: true}   # MySQL credentials as secret file
      - {name: vmail,     mountPath: /home/vmail}
      - {name: tls,       mountPath: /etc/dovecot/ssl, readOnly: true}
      - {name: mail-logs, mountPath: /var/log/mail}
    - name: log-tailer   # sidecar: tails dovecot.log to stdout for kubectl logs
      image: docker.io/library/alpine:3
      command: ["/bin/sh", "-c", "tail -F /var/log/mail/dovecot.log"]
      volumeMounts:
      - {name: mail-logs, mountPath: /var/log/mail}

Rspamd:

# configuration/mailstack/rspamd-deployment.yaml (excerpt)
metadata:
  annotations:
    prometheus.io/scrape: "true"    # Alloy picks this up for Grafana Cloud
    prometheus.io/port: "11334"
    prometheus.io/path: "/metrics"
containers:
- name: rspamd
  image: docker.io/rspamd/rspamd:4.0.1
  securityContext:
    privileged: true
  ports:
  - {containerPort: 11332, name: milter, hostPort: 11332}
  - {containerPort: 11334, name: controller}
  resources:
    requests:
      memory: 128Mi
    limits:
      memory: 768Mi    # raised from 512Mi — startup spike causes OOMKill on node reboot
  livenessProbe:
    httpGet: {path: /ping, port: 11334}   # use httpGet, NOT tcpSocket
  readinessProbe:
    httpGet: {path: /ping, port: 11334}
  volumeMounts:
  - {name: config,    mountPath: /etc/rspamd/local.d}
  - {name: dkim-keys, mountPath: /etc/rspamd/dkim, readOnly: true}

Key Rspamd config (local.d/dkim_signing.conf): set try_fallback = false to prevent signing with the wrong key when a domain’s DKIM key is missing. Set secure_ip to the pod bridge IP (10.42.0.1) so Grafana Alloy can scrape /metrics without authentication.

ClamAV: Run freshclam as a shell loop sidecar, not as an init container (freshclam takes minutes — init containers block pod startup):

- name: freshclam
  image: docker.io/clamav/clamav:latest
  command: ["/bin/sh", "-c"]
  args: ["while true; do freshclam; sleep 3600; done"]

After deploying, activate fail2ban mail jails (see Step 3, Phase 2 above).

Monitoring (Grafana Cloud)

  • Grafana Alloy runs as a DaemonSet with hostNetwork: true — scrapes metrics from the node and Kubernetes API
  • Grafana Operator manages GrafanaDashboard CRDs and pushes them to Grafana Cloud
  • kube-state-metrics for Kubernetes object metrics — needs a custom SCC for the hostmount-anyuid service account
  • Important: The Grafana Operator reconciles dashboards every ~10 minutes and overwrites any UI changes. Always fix dashboards in the YAML, not the UI.
  • Disable leader election: Add leaderElect: false to the HelmRelease values. On a single-node cluster there is never a competing operator instance — leader election only adds unnecessary risk of crashes when the kube-apiserver has brief pauses (e.g. during etcd compaction).
  • Loki token scope: needs logs:write (created under Stack → Loki → Send Logs, not under Access Policies)

ClickHouse for Rspamd Analytics (optional)

ClickHouse receives per-message metadata from Rspamd and enables detailed spam/ham analysis in Grafana. Rspamd writes to ClickHouse via the HTTP interface; the Grafana datasource uses a suspended CRD approach with a one-time API call to inject credentials.


Step 8 — Operations & Automation

All maintenance scripts live in /data/scripts/ (Btrfs-backed, included in btrbk backups) with symlinks in /usr/local/bin/. They run as systemd oneshot services with timers:

ScriptTimerPurpose
sendtelegram.sh— (library)Send Telegram messages — called by other scripts
boot-backup.shdaily + post-dnf5-automaticBackup /boot, /boot/efi, partition table
check-backup-btrfs-subvolumes.shdailyAlert if any Btrfs subvolume lacks btrbk or snapper config
unban-fail2ban-clients.shhourlyUnban VPN client IPs from all fail2ban jails
pre-flux-snapshot.shSSH trigger (GitHub Action)Btrfs snapshot before Flux reconcile
check-flux-update.shweeklyTelegram alert when a new Flux version is available
sync-wildcard-tls.shMon 00:00Copy wildcard TLS secret to all app namespaces
system-check.shdaily 08:00Telegram status: WG peers, pods, mail (24h), fail2ban, disk
podman-image-cleanup.shMon 03:00Remove dangling + unused Podman images, Telegram report
snap-all— (manual, before risky changes)Snapshot all 8 snapper configs at once (root home data var_lib_pvc var_lib_microshift var var_log var_lib_containers)

system-check.sh

#!/bin/bash
export KUBECONFIG=/var/lib/microshift/resources/kubeadmin/kubeconfig
source /etc/telegramrc

send() {
  curl -s -X POST "https://api.telegram.org/bot${TOKEN}/sendMessage" \
    -d chat_id="${CHATID}" -d parse_mode="HTML" \
    --data-urlencode text="$1" > /dev/null
}

# WireGuard peer status
WG_PEERS=$(wg show wg0 | grep -c "latest handshake")
WG_RECENT=$(wg show wg0 | awk '/latest handshake/ {
  if ($0 ~ /second/ || $0 ~ /minute/) count++
  else if ($0 ~ /hour/) { match($0, /([0-9]+) hour/, a); if (a[1]+0 < 2) count++ }
} END{print count+0}')

# Kubernetes pod health
NOT_RUNNING=$(kubectl get pods -A --no-headers 2>/dev/null \
  | grep -v -E "Running|Completed" | wc -l)
NOT_RUNNING_LIST=$(kubectl get pods -A --no-headers 2>/dev/null \
  | grep -v -E "Running|Completed" \
  | awk '{print $1"/"$2" ("$4")"}' | head -10)
TOTAL_PODS=$(kubectl get pods -A --no-headers 2>/dev/null | grep -c "Running")

# Mail stats (last 24h from postfix logs)
MAIL_IN=$(kubectl logs -n mailstack deploy/mail-postfix --since=24h 2>/dev/null \
  | grep -c "postfix/lmtp.*status=sent" || echo 0)
MAIL_OUT=$(kubectl logs -n mailstack deploy/mail-postfix --since=24h 2>/dev/null \
  | grep -c "postfix/smtp.*status=sent" || echo 0)
MAIL_REJECT=$(kubectl logs -n mailstack deploy/mail-postfix --since=24h 2>/dev/null \
  | grep -c "NOQUEUE: reject" || echo 0)
QUEUE_RAW=$(kubectl exec -n mailstack deploy/mail-postfix -- postqueue -p 2>/dev/null \
  | tail -1)
QUEUE=$(echo "$QUEUE_RAW" | grep -q "empty" && echo "leer" || echo "$QUEUE_RAW")

# fail2ban
BANNED_COUNT=0
for jail in $(fail2ban-client status 2>/dev/null \
  | grep "Jail list:" | sed 's/.*Jail list:\s*//' | tr ', ' '\n' | grep -v '^$'); do
  n=$(fail2ban-client status "$jail" 2>/dev/null \
    | grep "Currently banned:" | awk '{print $NF}')
  BANNED_COUNT=$((BANNED_COUNT + ${n:-0}))
done

DISK_FREE=$(df -h / | awk 'NR==2 {print $4 " von " $2 " frei (" $5 " belegt)"}')

# Build status icons
[ "$WG_RECENT" -ge 1 ] \
  && WG_STATUS="Aktiv: ${WG_RECENT}/${WG_PEERS} Peers (Handshake unter 2h)" \
  && WG_ICON="OK" \
  || { WG_STATUS="Keine aktiven Peers!"; WG_ICON="WARN"; }

[ "$NOT_RUNNING" -eq 0 ] \
  && K8S_STATUS="${TOTAL_PODS} Pods Running" && K8S_ICON="OK" \
  || { K8S_STATUS="${NOT_RUNNING} Pod(s) nicht Running:\n${NOT_RUNNING_LIST}"; K8S_ICON="WARN"; }

[ "$BANNED_COUNT" -gt 0 ] && F2B_ICON="WARN" || F2B_ICON="OK"

send "System-Check $(date '+%d.%m.%Y %H:%M')

WireGuard [${WG_ICON}]
${WG_STATUS}

Kubernetes [${K8S_ICON}]
${K8S_STATUS}

Mail letzte 24h:
  Eingehend zugestellt: ${MAIL_IN}
  Ausgehend gesendet:   ${MAIL_OUT}
  Rejects:              ${MAIL_REJECT}
  Queue:                ${QUEUE}

Fail2ban [${F2B_ICON}]
  Gebannte Clients:     ${BANNED_COUNT}

Speicher:
  ${DISK_FREE}"

Timer — once daily at 08:00:

# /etc/systemd/system/system-check.timer
[Unit]
Description=System Check täglich 08:00
[Timer]
OnCalendar=*-*-* 08:00:00
Persistent=false
[Install]
WantedBy=timers.target

sync-wildcard-tls.sh

#!/bin/bash
export KUBECONFIG=/var/lib/microshift/resources/kubeadmin/kubeconfig
NAMESPACES=(pihole vaultwarden nextcloud paperless collabora mailstack homepage)

for NS in "${NAMESPACES[@]}"; do
  export NS
  kubectl get secret wildcard-tls -n cert-manager -o json | python3 -c "
import json, sys, os
s = json.load(sys.stdin)
ns = os.environ['NS']
s['metadata']['namespace'] = ns
s['metadata']['name'] = ns + '-tls'
for k in ['resourceVersion','uid','creationTimestamp','managedFields',
          'annotations','ownerReferences','labels']:
    s['metadata'].pop(k, None)
print(json.dumps(s))
" | kubectl apply -f -
  echo "Synced wildcard-tls -> ${NS}/${NS}-tls"
done

pre-flux-snapshot.sh

#!/bin/bash
export KUBECONFIG=/var/lib/microshift/resources/kubeadmin/kubeconfig
DESCRIPTION="vor-flux-update-$(date +%Y%m%d-%H%M%S)"
flux suspend ks --all -n flux-system 2>/dev/null
snap-all "$DESCRIPTION"
flux resume ks --all -n flux-system 2>/dev/null
echo "Snapshots erstellt: $DESCRIPTION"

check-backup-btrfs-subvolumes.sh

Verifies all Btrfs subvolumes are tracked in both btrbk.conf and snapper. Reports gaps via Telegram (triggered by systemd OnFailure):

#!/bin/bash
set -euo pipefail
BTRBK_CONF=/etc/btrbk/btrbk.conf
BTRFS_TOP=/mnt/btrfs-top

EXCLUDED=(btrbk_snapshots var_cache var_tmp var_lib_kubelet)
EXCLUDED_PATTERNS=("root-snap-pre-*")

MOUNTED_HERE=0
if ! mountpoint -q "$BTRFS_TOP" 2>/dev/null; then
  mount "$BTRFS_TOP"; MOUNTED_HERE=1
fi
trap '[[ $MOUNTED_HERE -eq 1 ]] && umount "$BTRFS_TOP" 2>/dev/null || true' EXIT

mapfile -t CURRENT_SUBVOLS < <(
  btrfs subvolume list "$BTRFS_TOP" \
    | awk '$7 == 5 && $NF !~ /\// { print $NF }' | sort)
mapfile -t CONFIGURED < <(
  grep -E '^\s+subvolume\s+' "$BTRBK_CONF" | awk '{print $2}' | sort)

is_excluded() {
  local sv="$1"
  for ex in "${EXCLUDED[@]}"; do [[ "$sv" == "$ex" ]] && return 0; done
  for pat in "${EXCLUDED_PATTERNS[@]}"; do [[ "$sv" == $pat ]] && return 0; done
  return 1
}

MISSING_BTRBK=()
MISSING_SNAPPER=()
for sv in "${CURRENT_SUBVOLS[@]}"; do
  is_excluded "$sv" && continue
  configured=0
  for conf in "${CONFIGURED[@]}"; do [[ "$sv" == "$conf" ]] && configured=1 && break; done
  [[ $configured -eq 0 ]] && MISSING_BTRBK+=("$sv")
  [[ -f "/etc/snapper/configs/$sv" ]] || MISSING_SNAPPER+=("$sv")
done

OVERALL_EXIT=0
[[ ${#MISSING_BTRBK[@]} -gt 0 ]] && { echo "WARN btrbk: ${MISSING_BTRBK[*]}"; OVERALL_EXIT=1; }
[[ ${#MISSING_SNAPPER[@]} -gt 0 ]] && { echo "WARN snapper: ${MISSING_SNAPPER[*]}"; OVERALL_EXIT=1; }
[[ $OVERALL_EXIT -eq 0 ]] && echo "OK: Alle Subvolumes in btrbk und snapper konfiguriert."
exit $OVERALL_EXIT

Output: daily system-check message

System-Check 19.04.2026 08:00

WireGuard [OK]
Aktiv: 10/12 Peers (Handshake unter 2h)

Kubernetes [OK]
45 Pods Running

Mail letzte 24h:
  Eingehend zugestellt: 31
  Ausgehend gesendet:   22
  Rejects:               3
  Queue:                leer

Fail2ban [OK]
  Gebannte Clients:      2

Speicher:
  290G von 610G frei (52% belegt)

Step 9 — Backup & Disaster Recovery

Three independent layers protect the system. Each layer has a distinct job and they compose cleanly:

LayerToolScopeWhereRetentionUse case
Local timeline snapshotssnapperAll 8 subvolumesOn-disk .snapshots/24h hourly · 8d daily · 5w weeklyInstant rollback without unmounting
Local btrbk snapshotsbtrbkAll 8 subvolumesbtrbk_snapshots/Latest only (parent reference)Basis for incremental send to Pi
Remote backupbtrbk → PiAll 8 subvolumesRaspberry Pi via WireGuard24h hourly · 8d daily · 5w weeklyFull recovery after VM loss

snapper handles local rollback; btrbk handles the remote copy. Local btrbk snapshots are kept minimal — just one per subvolume as the parent reference so the next send to the Pi remains incremental instead of a full transfer.

What is Backed Up

SubvolumeMountpointsnapperbtrbk → PiContains
root/System, packages, /etc
home/homeUser home directories
data/dataGitOps repo, scripts
var_lib_pvc/var/lib/pvcKubernetes PVC data (app data)
var_lib_microshift/var/lib/microshiftCluster state, kubeconfig
var/varSystem state, boot backup
var_log/var/logLogs
var_lib_containers/var/lib/containersContainer images

Not backed up intentionally: var_cache, var_tmp, var_lib_kubelet — ephemeral or rebuildable.

Pre-Update Snapshots

Before every GitOps update, a GitHub Action SSHs into the server and runs:

flux suspend ks --all -n flux-system
snap-all "vor-flux-update-$(date +%Y%m%d-%H%M%S)"
flux resume ks --all -n flux-system

snap-all creates one numbered snapshot across all 8 snapper configs simultaneously. If a Flux reconcile breaks something, you can roll back any or all subvolumes to the pre-update state within seconds — no mount required, just snapper -c <cfg> rollback <nr>.

Monitoring

check-backup-btrfs-subvolumes.sh runs daily via systemd timer. It verifies that every non-excluded Btrfs subvolume has both a btrbk entry and a snapper config. On failure, OnFailure=check-btrbk-notify.service sends a Telegram alert.

check-backup-btrfs-subvolumes.sh
# → OK: Alle relevanten Subvolumes sind in btrbk.conf konfiguriert.
# → OK: Alle relevanten Subvolumes haben eine snapper-Config.

Scenario A — Rollback (VM still running)

A1. /etc rollback via etckeeper

etckeeper commits every change to /etc as a git commit. Single-file recovery is a one-liner:

# See recent commits
etckeeper vcs log --oneline -10

# Restore a single file
etckeeper vcs checkout HEAD~1 -- /etc/microshift/config.yaml

A2. Root rollback via snapper

The fastest path for a broken system update is the GRUB menu. grub-btrfs adds every snapper snapshot automatically:

1. reboot
2. In the GRUB menu: "Fedora Linux snapshots"
3. Select the snapshot (e.g. "vor-flux-update-20260515-093000")
4. System boots read-only into the snapshot
5. If it works — make the rollback permanent:
snapper rollback   # sets snapshot as new default subvolume
reboot             # boots into the restored, writable system

Without rebooting into the snapshot first:

snapper -c root list
snapper -c root rollback 116   # rolls back root subvolume
reboot

A3. App data rollback from snapper

snapper snapshots are directly accessible as read-only directories — no mounting required:

# For any subvolume, the snapshots are here:
# /var/lib/pvc/.snapshots/<nr>/snapshot/
# /home/.snapshots/<nr>/snapshot/
# etc.

# Example: restore a PVC directory from snapshot 42
snapper -c var_lib_pvc list
kubectl scale deploy vaultwarden -n vaultwarden --replicas=0
cp -a /var/lib/pvc/.snapshots/42/snapshot/<pvc-dir>/. /var/lib/pvc/<pvc-dir>/
kubectl scale deploy vaultwarden -n vaultwarden --replicas=1

For apps with multiple PVCs (Nextcloud, Paperless), always restore all related PVCs in the same operation to avoid split-brain.

A4. Individual subvolume from btrbk snapshot

If the snapper snapshot is too recent or you need a specific point in the btrbk window:

mount /mnt/btrfs-top
btrbk list snapshots   # find the right snapshot name

systemctl stop microshift

# Save current state
btrfs subvolume snapshot /mnt/btrfs-top/var_lib_pvc \
  /mnt/btrfs-top/btrbk_snapshots/var_lib_pvc.recovery-backup

# Replace with snapshot
umount /var/lib/pvc
btrfs subvolume delete /mnt/btrfs-top/var_lib_pvc
btrfs subvolume snapshot \
  /mnt/btrfs-top/btrbk_snapshots/var_lib_pvc.20260515T0200 \
  /mnt/btrfs-top/var_lib_pvc
mount /var/lib/pvc

systemctl start microshift
umount /mnt/btrfs-top

Scenario B — Full VM Loss (restore from Pi)

B1. Provision new VM, boot Fedora LiveISO

dnf install -y btrfs-progs rsync gdisk

B2. Restore partition table

# Fetch partition table from Pi backup
scp mko@10.0.0.12:/backup/btrfs/server/var/<snapshot>/lib/boot-backup/partition-table.sfdisk /tmp/

# Restore (same disk size)
sfdisk /dev/vda < /tmp/partition-table.sfdisk

B3. Create filesystems with original UUIDs

mkfs.vfat -F32 /dev/vda1
mkfs.xfs -f /dev/vda2
xfs_admin -U bafddb4f-cda4-4aae-9d4b-b75da20e3680 /dev/vda2

mkfs.btrfs -f \
  -L "fedora_v220241250910305995" \
  -U 9017b7d7-c4e6-45a3-9c04-48f25cd640fd \
  /dev/vda3

The UUIDs must match because they are hardcoded in /etc/fstab and the GRUB config.

B4. Mount top-level, receive subvolumes from Pi

mount -o subvolid=5 /dev/vda3 /mnt/restore

for sv in root var home var_log var_lib_microshift var_lib_pvc var_lib_containers data; do
  SNAP=$(ssh mko@10.0.0.12 "ls /backup/btrfs/server/${sv}/ | sort | tail -1")
  ssh mko@10.0.0.12 "btrfs send /backup/btrfs/server/${sv}/${SNAP}" \
    | btrfs receive /mnt/restore/
  # Convert read-only snapshot to writable subvolume
  btrfs subvolume snapshot /mnt/restore/${SNAP} /mnt/restore/${sv}_new
  btrfs subvolume delete /mnt/restore/${SNAP}
  mv /mnt/restore/${sv}_new /mnt/restore/${sv}
done

B5. Restore /boot and reinstall GRUB

mount /dev/vda2 /mnt/boot
mount /dev/vda1 /mnt/efi
rsync -aAX /mnt/restore/var/lib/boot-backup/boot/ /mnt/boot/
rsync -aAX /mnt/restore/var/lib/boot-backup/efi/  /mnt/efi/

mount -o subvol=root /dev/vda3 /mnt/sysroot
# ... mount remaining subvolumes ...
for d in dev dev/pts proc sys run; do mount --bind /$d /mnt/sysroot/$d; done
chroot /mnt/sysroot grub2-install /dev/vda
chroot /mnt/sysroot grub2-mkconfig -o /boot/grub2/grub.cfg

touch /mnt/sysroot/.autorelabel   # trigger SELinux relabel on first boot
umount -R /mnt/sysroot
reboot

After reboot, SELinux relabels all files (a few minutes), then a second reboot brings the system fully up.

Recovery point

The Pi holds 24 hourly + 8 daily + 5 weekly snapshots per subvolume. You can recover to any point within that five-week window by selecting an older snapshot in step B4.


Key Lessons Learned

Btrfs quotas kill etcd. Don’t enable them. The combination of hundreds of snapshots + hourly btrbk cleanup + qgroup rescans will crash MicroShift reliably every hour. See Step 2 for full explanation.

MicroShift SCCs are opt-in. Every service account that runs a privileged workload needs an explicit SCC binding. Don’t assume privileged is inherited. Check with kubectl auth can-i use scc/privileged --as=system:serviceaccount:<ns>:<sa>.

nginx in OpenShift must run as non-root. The standard nginx:alpine image listens on port 80 and requires root. Use nginxinc/nginx-unprivileged:alpine on port 8080 instead. MicroShift’s HAProxy router handles TLS termination and proxies to port 8080 with the original Host header.

kindnet’s POD_SUBNET must match MicroShift’s pod CIDR. The default kindnet value is 10.244.0.0/16. MicroShift uses 10.42.0.0/16. Mismatch causes subtle masquerade failures that only appear with IP-whitelisted applications. MicroShift 4.21+ ships internal kindnet manifests that deploy with the wrong default regardless of cniPlugin: none — fix via kustomizePaths in /etc/microshift/config.d/ to exclude /usr/lib/microshift/manifests.d/000-microshift-kindnet/.

Use hostNetwork: true for mail ports, not hostPort. hostPort alone routes new connections through the pod network bridge (kindnet). The source IP arriving at Postfix or Dovecot becomes the bridge address (10.42.0.1), not the real client IP — fail2ban is blind to the attacker. hostNetwork: true connects the pod directly to the host network stack and preserves source IPs end-to-end. Set dnsPolicy: ClusterFirstWithHostNet so cluster-internal DNS (maildb, rspamd) still resolves correctly.

firewalld inter-zone forwarding requires policies. You cannot use --add-forward-port or zone rules to forward traffic between zones. Create explicit Policy objects with ingress/egress zones.

Dovecot 2.4 is a breaking upgrade from 2.3. Dozens of config options were renamed, removed, or changed defaults. Budget significant time for migration testing. Key gotcha: MySQL passwords in mysql{} config blocks cannot use %{env:VAR} syntax — pass them as mounted secret files.

GitOps changes don’t restart pods. Flux applies ConfigMap changes but doesn’t trigger pod restarts. If your application doesn’t watch for config file changes, restart manually with kubectl rollout restart deployment/<name> -n <ns>.


Repository Structure

configuration/
├── kustomization.yaml          ← Flux entry point
├── flux-system/                ← Flux own manifests
├── acme-dns/                   ← acme-dns service
├── cert-manager/               ← cert-manager + ClusterIssuer + Certificate
├── pihole/                     ← DNS server
├── vaultwarden/                ← Password manager
├── paperless/                  ← Document management
├── nextcloud/                  ← File sync + MariaDB
├── collabora/                  ← Online Office
├── mailstack/                  ← Postfix + Dovecot + Rspamd + ClamAV + Redis + MariaDB + PostfixAdmin
├── monitoring/                 ← Alloy + kube-state-metrics
├── monitoring-grafana/         ← Grafana Operator + dashboards
└── homepage/                   ← Static website