Skip to content

ARM64 k3s Cluster Operations: Turing Pi, Node Management, and Recovery

An operations guide for running a 4-node ARM64 k3s cluster on Turing RK1 compute modules in a Turing Pi 2 board. This covers the non-obvious parts of keeping a bare-metal homelab cluster healthy: boot configuration quirks on Rockchip BSP kernels, the cgroup v1 to v2 migration, k3s’s dynamic TLS certificate system and how it breaks agents after server reboots, kube-proxy iptables recovery after power loss, disk pressure from uncontrolled logs, and recovering bricked nodes through the BMC serial console.

Everything here was learned the hard way through actual cluster failures. Each section includes the root cause analysis, the fix, and the gotchas encountered along the way.


NodeIPk3s RoleTuring Pi SlotNotes
node110.0.0.9control-plane (server)1API server, etcd, scheduler
node210.0.0.10agent2
node310.0.0.11agent3NFS server (~460 GB at /data)
node410.0.0.12agent4

All nodes: Turing RK1 (Rockchip RK3588, 8 GB RAM, 29 GB SD card), Ubuntu 22.04, kernel 5.10.160-rockchip (BSP), k3s v1.34.3+k3s3, containerd v2.1.5.

d2 diagram
ComponentVersion
k3sv1.34.3+k3s3
Containerdv2.1.5-k3s1
Kernel5.10.160-rockchip (BSP)
Ubuntu22.04 LTS (Jammy)
U-BootRockchip (vendor)
Turing Pi BMC(accessible at BMC IP on the LAN)
tpi CLILatest from Turing Pi
inventory.yml
k3s_cluster:
children:
server:
hosts:
10.0.0.9:
agent:
hosts:
10.0.0.10:
10.0.0.11:
10.0.0.12:
vars:
ansible_port: 22
ansible_user: your_user
ansible_python_interpreter: /usr/bin/python3
k3s_version: v1.34.3+k3s3
extra_server_args: --disable traefik --disable servicelb --node-taint node-role.kubernetes.io/control-plane=:PreferNoSchedule

The server has --disable traefik --disable servicelb because both are replaced with custom deployments (see the Traefik and monitoring guides). The --node-taint flag applies a PreferNoSchedule taint to the control plane node, discouraging (but not preventing) workload pods from landing on the server node. See Part 12 for why this is critical.


The Turing RK1 modules run a Rockchip BSP U-Boot that loads its configuration from /boot/firmware/. The key files:

FilePurpose
/boot/firmware/ubuntuEnv.txtKernel cmdline (bootargs=), DTB file, overlays
/boot/firmware/boot.cmdU-Boot script source (human-readable)
/boot/firmware/boot.scrCompiled U-Boot script (binary, loaded by U-Boot)

To change kernel boot parameters, edit ubuntuEnv.txt. You do not need to recompile boot.scr — U-Boot reads ubuntuEnv.txt at boot and substitutes the variables into the boot script.

bootargs=root=UUID=<uuid> rootfstype=ext4 rootwait rw console=ttyS9,115200 console=ttyS2,1500000 console=tty1 systemd.unified_cgroup_hierarchy=1
fdtfile=rk3588-turing-rk1.dtb
overlay_prefix=rk3588
overlays=

The RK1 modules ship with cgroup v1 enabled explicitly in the kernel cmdline. Kubernetes and containerd have deprecated cgroup v1 support and the kernel 5.10 Rockchip BSP supports cgroup v2 with the systemd unified hierarchy.

The original bootargs contain these cgroup v1 flags:

cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory swapaccount=1 systemd.unified_cgroup_hierarchy=0

These need to be replaced with:

systemd.unified_cgroup_hierarchy=1

Swap also needs to be disabled — k3s on cgroup v2 with swap enabled produces tmpfs-noswap warnings and kubelet considers memory-backed volumes (secrets, emptyDirs) insecure because they could be swapped to disk.

The playbook processes agents first (one at a time, rolling), then the server last. Each node is drained, rebooted, verified, and uncordoned before moving to the next.

cgroup-v2-swap-off.yml
- name: Disable swap and enable cgroup v2 on agent nodes
hosts: agent
become: yes
serial: 1
tasks:
- name: Disable swap immediately
ansible.builtin.command: swapoff -a
changed_when: true
- name: Remove swap entry from fstab
ansible.builtin.lineinfile:
path: /etc/fstab
regexp: '^\s*/swapfile\s'
state: absent
- name: Delete swapfile
ansible.builtin.file:
path: /swapfile
state: absent
- name: Update bootargs — remove cgroup v1 flags and enable cgroup v2
ansible.builtin.replace:
path: /boot/firmware/ubuntuEnv.txt
# NOTE: Use literal ' *' at end, NOT '\s*' — see Boot Configuration section
regexp: 'cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory swapaccount=1 systemd\.unified_cgroup_hierarchy=0 *'
replace: "systemd.unified_cgroup_hierarchy=1 "
- name: Drain node before reboot
delegate_to: localhost
become: no
ansible.builtin.command: >
kubectl drain {{ ansible_hostname }}
--ignore-daemonsets --delete-emptydir-data --timeout=120s --force
- name: Reboot into cgroup v2
ansible.builtin.reboot:
reboot_timeout: 300
msg: "Rebooting for cgroup v2 + swap off"
pre_reboot_delay: 5
post_reboot_delay: 30
- name: Restart k3s-agent to refresh certificates
ansible.builtin.systemd:
name: k3s-agent
state: restarted
- name: Wait for k3s-agent to be active
ansible.builtin.shell: systemctl is-active k3s-agent
register: k3s_status
until: k3s_status.rc == 0
retries: 18
delay: 10
changed_when: false
- name: Uncordon node
delegate_to: localhost
become: no
ansible.builtin.command: kubectl uncordon {{ ansible_hostname }}
- name: Wait for node to be Ready
delegate_to: localhost
become: no
ansible.builtin.shell: >
kubectl get node {{ ansible_hostname }}
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
register: node_ready
until: node_ready.stdout == "True"
retries: 30
delay: 10
changed_when: false
- name: Verify cgroup v2 is active
ansible.builtin.shell: stat -f --format=%T /sys/fs/cgroup
register: cgroupfs
changed_when: false
failed_when: cgroupfs.stdout != "cgroup2fs"
- name: Verify swap is off
ansible.builtin.command: swapon --show
register: swap_check
changed_when: false
failed_when: swap_check.stdout | length > 0
- name: Disable swap and enable cgroup v2 on server node
hosts: server
become: yes
tasks:
# Same tasks as above, but:
# - service name is 'k3s' not 'k3s-agent'
# - post_reboot_delay is 60 (server takes longer — etcd bootstrap)
# - server is done last to avoid cascading agent failures

After the server reboots, its dynamic listener TLS cert may change. Agents that booted before or concurrently with the server will cache the old CA and fail with x509: certificate signed by unknown authority. The playbook includes a Restart k3s-agent to refresh certificates step to handle this, but it’s not always sufficient. See Part 3 for the full explanation and fix.

Terminal window
# On each node:
stat -f --format=%T /sys/fs/cgroup
# Expected: cgroup2fs
swapon --show
# Expected: (empty output)
free -h | grep Swap
# Expected: Swap: 0B 0B 0B

After migration, containerd v2.1 on k3s 1.34 produces an InvalidDiskCapacity warning on every kubelet start:

Warning InvalidDiskCapacity kubelet invalid capacity 0 on image filesystem

This is caused by disable_snapshot_annotations = true in the auto-generated containerd config. It’s cosmetic — disk capacity reporting works fine. It will resolve with a future k3s upgrade.


This is the single most disruptive issue on this cluster. After a server reboot (or power cycle), agents frequently get stuck with:

tls: failed to verify certificate: x509: certificate signed by unknown authority

k3s has two separate CA systems and a dynamic TLS certificate:

ComponentPath (server)Purpose
Server CA/var/lib/rancher/k3s/server/tls/server-ca.crtSigns the API server’s serving certificate
Client CA/var/lib/rancher/k3s/server/tls/client-ca.crtSigns kubelet/kube-proxy client certs
Dynamic listener cert/var/lib/rancher/k3s/server/tls/dynamic-cert.jsonThe actual TLS cert presented on port 6443, managed by rancher/dynamiclistener

Agents cache the server CA and client CA locally:

ComponentPath (agent)
Cached server CA/var/lib/rancher/k3s/agent/server-ca.crt
Cached client CA/var/lib/rancher/k3s/agent/client-ca.crt

The dynamic listener cert is stored in three places: a Kubernetes Secret (k3s-serving in kube-system), a local file cache, and in memory. On server startup, before etcd is available, the dynamic listener uses the file cache. If this cert doesn’t match what agents expect (because the cert was regenerated, or the agent has a stale CA from a previous boot), agents can’t verify the server and enter a TLS error loop.

The critical point: systemctl restart k3s-agent does not fix this. The agent reloads the same stale CA certs from disk. You must delete the cached CA files so the agent re-fetches them from the server’s /cacerts endpoint.

This procedure fixes the common case where the server CA has NOT changed but agents have stale cached copies (e.g. after a server reboot or power cycle).

Step 1: Verify the server is up and the API is working:

Terminal window
# From the server node itself:
kubectl get nodes --kubeconfig /etc/rancher/k3s/k3s.yaml
# Or via ansible:
ansible server -i inventory.yml -m shell \
-a "kubectl get nodes --kubeconfig /etc/rancher/k3s/k3s.yaml" \
-b -e "ansible_become_pass=<pass>"

Step 2: On each broken agent, stop the agent, delete cached certs, start:

Terminal window
# Via ansible (one node at a time):
ansible <agent_ip> -i inventory.yml -m shell \
-a "systemctl stop k3s-agent; \
sleep 2; \
rm -f /var/lib/rancher/k3s/agent/server-ca.crt \
/var/lib/rancher/k3s/agent/client-ca.crt; \
nohup systemctl start k3s-agent &" \
-b -e "ansible_become_pass=<pass>"

Step 3: Wait 60-90 seconds, then verify:

Terminal window
kubectl get nodes

If an agent is stuck and you’re not sure whether it’s a cert issue, compare the CA fingerprints:

Terminal window
# On the server — the authoritative CA:
openssl x509 -in /var/lib/rancher/k3s/server/tls/server-ca.crt \
-noout -fingerprint -sha256
# On the agent — what it has cached:
openssl x509 -in /var/lib/rancher/k3s/agent/server-ca.crt \
-noout -fingerprint -sha256
# What the server's /cacerts endpoint returns:
curl -sk https://<server_ip>:6443/cacerts | \
openssl x509 -noout -fingerprint -sha256

If the agent’s fingerprint doesn’t match the server’s, the cert is stale.

ApproachDifficultyEffect
Custom CA certs (pre-created before first server start)High (requires reinstall)CA never changes across reboots
--tls-san on server configLowReduces dynamic cert regeneration
Health watchdog with TLS detectionMediumAuto-recovers agents (see Part 4)
Boot order: server first, agents afterLowAvoids race condition

A systemd timer that runs every 5 minutes on each node and detects two failure modes:

  1. Missing kube-proxy iptables rules (common after power loss) — attempts a non-disruptive canary chain resync before falling back to a service restart
  2. Any API failure (stale certs, unauthorized, connection refused) — always clears cached CA certs before restarting

Key safety features:

  • 10-minute cooldown: skips if the agent was restarted recently, preventing a death loop
  • Server reachability gate: won’t restart the agent if the server API is unreachable (would just get bad certs)
#!/bin/bash
# k3s agent health watchdog
SERVICE=k3s-agent
KUBECONFIG=/var/lib/rancher/k3s/agent/kubelet.kubeconfig
NODENAME=$(hostname)
SERVER_URL="https://10.0.0.1:6443" # Replace with your server IP
COOLDOWN_SECONDS=600 # 10 minutes
CERT_DIR=/var/lib/rancher/k3s/agent
# ---- Cooldown check ----
# If the agent was (re)started recently, don't interfere — give it
# time to stabilize.
ACTIVE_ENTER=$(systemctl show "$SERVICE" \
--property=ActiveEnterTimestamp --value 2>/dev/null)
if [ -n "$ACTIVE_ENTER" ]; then
ENTER_EPOCH=$(date -d "$ACTIVE_ENTER" +%s 2>/dev/null || echo 0)
NOW_EPOCH=$(date +%s)
AGE=$(( NOW_EPOCH - ENTER_EPOCH ))
if [ "$AGE" -lt "$COOLDOWN_SECONDS" ]; then
logger -t k3s-watchdog \
"Agent started ${AGE}s ago (cooldown ${COOLDOWN_SECONDS}s) — skipping"
exit 0
fi
fi
if ! systemctl is-active --quiet "$SERVICE"; then
logger -t k3s-watchdog "Agent is not active — skipping"
exit 0
fi
# ---- Check 1: kube-proxy iptables rules ----
if [ "$(iptables-save 2>/dev/null | grep -c KUBE-SVC)" -eq 0 ]; then
logger -t k3s-watchdog "kube-proxy iptables rules missing — deleting canary chains"
iptables -t mangle -F KUBE-PROXY-CANARY 2>/dev/null
iptables -t mangle -X KUBE-PROXY-CANARY 2>/dev/null
iptables -t nat -F KUBE-PROXY-CANARY 2>/dev/null
iptables -t nat -X KUBE-PROXY-CANARY 2>/dev/null
iptables -t filter -F KUBE-PROXY-CANARY 2>/dev/null
iptables -t filter -X KUBE-PROXY-CANARY 2>/dev/null
sleep 40
if [ "$(iptables-save 2>/dev/null | grep -c KUBE-SVC)" -eq 0 ]; then
logger -t k3s-watchdog "Canary resync failed — will restart in API check"
else
logger -t k3s-watchdog "kube-proxy iptables rules restored via canary resync"
fi
fi
# ---- Check 2: API reachability ----
# Use "get node $NODENAME" not "get nodes" — the kubelet kubeconfig
# (system:node:<name>) only has RBAC to read its own node object.
API_ERR=$(kubectl --kubeconfig="$KUBECONFIG" get node "$NODENAME" 2>&1)
API_RC=$?
if [ $API_RC -eq 0 ]; then
exit 0
fi
logger -t k3s-watchdog "API check failed for $NODENAME (rc=$API_RC): $API_ERR"
# ---- Server reachability gate ----
if ! curl -sk --max-time 5 "$SERVER_URL/cacerts" >/dev/null 2>&1; then
logger -t k3s-watchdog "Server API not reachable — skipping restart"
exit 0
fi
# ---- Restart with cert cleanup ----
# Always clear cached CA certs. Stale certs manifest as various errors:
# - x509: certificate signed by unknown authority
# - connection refused (local proxy can't auth to server)
# - 401 Unauthorized (server rejects stale client cert)
logger -t k3s-watchdog "Clearing cached CA certs and restarting $SERVICE"
systemctl stop "$SERVICE"
sleep 2
rm -f "$CERT_DIR/client-ca.crt" "$CERT_DIR/server-ca.crt"
systemctl start "$SERVICE"
logger -t k3s-watchdog "Agent restarted with fresh certs"

The server version only checks kube-proxy iptables. It intentionally does not check API auth or restart the k3s server, because restarting the server would invalidate all agent tokens and cause a cascading failure.

#!/bin/bash
# k3s server health watchdog
SERVICE=k3s
if ! systemctl is-active --quiet "$SERVICE"; then
exit 0
fi
if [ "$(iptables-save 2>/dev/null | grep -c KUBE-SVC)" -eq 0 ]; then
logger -t k3s-watchdog "kube-proxy iptables rules missing on server"
iptables -t mangle -F KUBE-PROXY-CANARY 2>/dev/null
iptables -t mangle -X KUBE-PROXY-CANARY 2>/dev/null
iptables -t nat -F KUBE-PROXY-CANARY 2>/dev/null
iptables -t nat -X KUBE-PROXY-CANARY 2>/dev/null
iptables -t filter -F KUBE-PROXY-CANARY 2>/dev/null
iptables -t filter -X KUBE-PROXY-CANARY 2>/dev/null
sleep 40
if [ "$(iptables-save 2>/dev/null | grep -c KUBE-SVC)" -eq 0 ]; then
logger -t k3s-watchdog "Canary resync failed — restarting $SERVICE as last resort"
systemctl restart "$SERVICE"
fi
fi

The watchdog is deployed via Ansible as an inline copy task — the script is embedded directly in the playbook rather than maintained as a separate file:

k3s-agent-health.yml
- name: Deploy k3s health watchdog on agent nodes
hosts: agent
become: yes
tasks:
- name: Install k3s health check script
ansible.builtin.copy:
dest: /usr/local/bin/k3s-health-check
mode: "0755"
content: |
#!/bin/bash
# (full script from above)
- name: Install systemd timer for health check
ansible.builtin.copy:
dest: /etc/systemd/system/k3s-health.timer
content: |
[Unit]
Description=k3s health watchdog timer
[Timer]
OnBootSec=3min
OnUnitActiveSec=5min
AccuracySec=30s
[Install]
WantedBy=timers.target
- name: Install systemd service for health check
ansible.builtin.copy:
dest: /etc/systemd/system/k3s-health.service
content: |
[Unit]
Description=k3s health watchdog
After=k3s-agent.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/k3s-health-check
- name: Reload systemd and enable timer
ansible.builtin.systemd:
daemon_reload: yes
- name: Enable and start the health check timer
ansible.builtin.systemd:
name: k3s-health.timer
enabled: yes
state: started
Terminal window
# View recent watchdog actions:
journalctl -t k3s-watchdog --no-pager --since '1 hour ago'
# Check timer status:
systemctl list-timers k3s-health.timer

With 29 GB SD cards, disk space is a constant concern. The main consumers are journal logs, containerd rotated logs, and monitoring data.

By default, journald can grow to 2.7-2.8 GB per node. Set a permanent limit:

/etc/systemd/journald.conf.d/size.conf
[Journal]
SystemMaxUse=256M

Deploy and apply:

Terminal window
ansible all -i inventory.yml -m copy \
-a "dest=/etc/systemd/journald.conf.d/size.conf content='[Journal]\nSystemMaxUse=256M\n'" \
-b -e "ansible_become_pass=<pass>"
ansible all -i inventory.yml -m shell \
-a "journalctl --vacuum-size=256M && systemctl restart systemd-journald" \
-b -e "ansible_become_pass=<pass>"

Containerd creates rotated logs (*.gz files) in /var/lib/rancher/k3s/agent/containerd/. These accumulate silently. Clean them periodically:

Terminal window
ansible all -i inventory.yml -m shell \
-a "find /var/lib/rancher/k3s/agent/containerd/ -name '*.gz' -delete" \
-b -e "ansible_become_pass=<pass>"

Ubuntu’s rsyslog can also accumulate large rotated logs. Configure aggressive rotation:

/etc/logrotate.d/rsyslog-aggressive
/var/log/syslog /var/log/kern.log /var/log/auth.log {
daily
rotate 3
compress
delaycompress
missingok
notifempty
maxsize 50M
}

Loki’s TSDB shipper writes index and cache data. If configured with an emptyDir volume, this writes to the node’s disk and causes disk pressure. Move it to NFS:

# In the Loki configmap, change tsdb_shipper paths:
schema_config:
configs:
- from: "2024-09-01"
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/tsdb-index # was /local/tsdb-index
cache_location: /loki/tsdb-cache # was /local/tsdb-cache
filesystem:
directory: /loki/chunks

Remove the emptyDir volume from the StatefulSet and make sure /loki/ is backed by the NFS PVC.


Swap is disabled as part of the cgroup v2 migration (see Part 2), but if you need to do it separately:

Terminal window
# Immediate:
swapoff -a
# Permanent — remove from fstab:
sed -i '/\/swapfile/d' /etc/fstab
# Reclaim disk space:
rm -f /swapfile # Frees ~2 GB per node

The RK1 modules have 8 GB RAM, which is enough for the workloads running on this cluster. With swap enabled on cgroup v2, kubelet produces tmpfs-noswap warnings because it can’t guarantee that memory-backed volumes (Kubernetes secrets, emptyDirs) stay in RAM.


When a node won’t boot (e.g. due to a corrupted ubuntuEnv.txt — see the regex gotcha in Part 1), you can recover it through the Turing Pi BMC’s serial console without physically accessing the board.

The tpi CLI tool must be installed and configured to talk to the BMC’s IP address on your LAN.

Terminal window
# Read serial output from slot 2 (rock2):
tpi uart -n 2 get
# Send a command to the serial console:
tpi uart -n 2 set -c 'ls /boot/firmware/'

Recovery scenario: corrupted ubuntuEnv.txt

Section titled “Recovery scenario: corrupted ubuntuEnv.txt”

This happened when an Ansible regex consumed a newline, merging fdtfile=rk3588-turing-rk1.dtb into the bootargs= line. U-Boot couldn’t find the DTB and the node dropped to a U-Boot shell.

Step 1: Power cycle the node and catch the U-Boot prompt:

Terminal window
tpi power -n 2 off
sleep 2
tpi power -n 2 on
# Watch serial output:
tpi uart -n 2 get

Step 2: If the node drops to a U-Boot shell, boot manually:

Terminal window
# Load the kernel, DTB, and initrd from the SD card:
tpi uart -n 2 set -c 'load mmc 1:1 ${kernel_addr_r} /boot/vmlinuz'
tpi uart -n 2 set -c 'load mmc 1:1 ${fdt_addr_r} /boot/dtbs/5.10.160-rockchip/rockchip/rk3588-turing-rk1.dtb'
tpi uart -n 2 set -c 'load mmc 1:1 ${ramdisk_addr_r} /boot/initrd.img'
tpi uart -n 2 set -c 'booti ${kernel_addr_r} ${ramdisk_addr_r} ${fdt_addr_r}'

Step 3: Once Linux boots, SSH in and fix ubuntuEnv.txt:

Terminal window
ssh your_user@10.0.0.10
sudo vi /boot/firmware/ubuntuEnv.txt
# Ensure bootargs and fdtfile are on separate lines
sudo reboot
Terminal window
# Power cycle a single slot:
tpi power -n <slot> off && sleep 2 && tpi power -n <slot> on
# Power cycle all slots:
tpi power -n 1 off && tpi power -n 2 off && tpi power -n 3 off && tpi power -n 4 off
sleep 2
tpi power -n 1 on && tpi power -n 2 on && tpi power -n 3 on && tpi power -n 4 on

On a cluster with ~265k active series, 59 scrape targets, and a 30s scrape interval, Prometheus uses approximately 1 GB of memory at steady state.

The default memory request from many Helm charts (512Mi) is too low. Set it to match actual usage to prevent OOM kills and ensure the scheduler places the pod on a node with sufficient capacity:

# In the Prometheus CR or manifest:
resources:
requests:
cpu: 200m
memory: 1Gi
limits:
memory: 2Gi

To right-size Prometheus, query its own metrics:

# Current RSS:
process_resident_memory_bytes{job="prometheus"}
# Active series count:
prometheus_tsdb_head_series
# Scrape target count:
count(up)
# Ingestion rate:
rate(prometheus_tsdb_head_samples_appended_total[5m])

After node reboots, agent restarts, or operator reconciliation conflicts, pods can be left in Error or Completed state. These consume API server resources and clutter kubectl get pods output.

Terminal window
# Delete all Error pods cluster-wide:
kubectl get pods -A --field-selector=status.phase=Failed \
-o jsonpath='{range .items[*]}{.metadata.namespace}{" "}{.metadata.name}{"\n"}{end}' | \
while read ns name; do kubectl delete pod -n "$ns" "$name"; done
# Delete all Completed (Succeeded) pods:
kubectl get pods -A --field-selector=status.phase=Succeeded \
-o jsonpath='{range .items[*]}{.metadata.namespace}{" "}{.metadata.name}{"\n"}{end}' | \
while read ns name; do kubectl delete pod -n "$ns" "$name"; done

Or more concisely:

Terminal window
kubectl delete pods -A --field-selector=status.phase=Failed
kubectl delete pods -A --field-selector=status.phase=Succeeded

KEDA’s ScaledObjects can conflict with Kubernetes operators that manage the same workloads.

If a KEDA ScaledObject targets the prometheus-prometheus StatefulSet directly, KEDA will scale it (e.g. to 4 replicas), but the Prometheus Operator’s Prometheus CR has replicas: 1. The operator reconciles constantly, trying to scale back down, while KEDA keeps scaling up. This creates crash-looping pods with volume mount failures because the PVCs don’t exist for the extra replicas.

Fix: Delete the KEDA ScaledObject for Prometheus. Let the Prometheus Operator manage the replica count through its CR. If you need Prometheus autoscaling, do it through the Prometheus CR’s replicas field or use a VPA instead.

KEDA ScaledObjects that use Prometheus as a trigger source need the correct service address. The Prometheus Operator creates a service called prometheus-operated (headless) and you typically create a ClusterIP service with a shorter name. Make sure the serverAddress in ScaledObject triggers matches your actual service:

triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
# NOT: prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090

When k3s first starts, it generates a node-join token stored at /var/lib/rancher/k3s/server/token:

K10<ca_hash>::server:<password>

The K10 prefix tells the agent to verify the server’s CA fingerprint against <ca_hash> during bootstrap. The <password> is the actual authentication credential.

If you reinstall or reset the k3s server (e.g. k3s-uninstall.sh then fresh install), a new CA is generated. But:

  1. The node-token file may keep the old CA hash
  2. The agent service files (created by Ansible or the installer) still have the old K10<old_hash>::server:<password> token
  3. The agent’s private keys and client certs (under /var/lib/rancher/k3s/agent/) were signed by the old CA

The result is agents that appear to connect intermittently but fail authentication. The error messages vary depending on timing:

  • x509: certificate signed by unknown authority (TLS handshake)
  • 401 Unauthorized (server rejects old client cert)
  • connection refused on 127.0.0.1:6443 (local proxy can’t authenticate to server)

Simply deleting cached CA certs (server-ca.crt, client-ca.crt) does not fix this because the agent re-bootstraps using the old token and old private keys, getting certs signed by the old CA again.

Compare the CA timestamps:

Terminal window
# Server's current CA:
openssl x509 -noout -issuer \
< /var/lib/rancher/k3s/server/tls/server-ca.crt
# e.g. issuer=CN = k3s-server-ca@1755079458
# Agent's cached CA:
openssl x509 -noout -issuer \
< /var/lib/rancher/k3s/agent/server-ca.crt
# e.g. issuer=CN = k3s-server-ca@1709313234 <-- MISMATCH
# Agent's client cert (who signed it?):
openssl x509 -noout -issuer \
< /var/lib/rancher/k3s/agent/client-kubelet.crt
# Should match the server's client-ca.crt

If the @timestamp values differ, the CA was rotated and agents need a full reinstall.

Fix: strip the CA pin and reinstall agents

Section titled “Fix: strip the CA pin and reinstall agents”

Step 1: Update the token in your Ansible inventory to use only the password (no K10 prefix):

inventory.yml
vars:
# Do NOT use the K10<hash> form. It pins to a specific CA and
# breaks when the server CA rotates.
token: "<your_server_password>"

You can find the password portion from the server’s token file:

Terminal window
# On the server:
cat /var/lib/rancher/k3s/server/token
# K10<hash>::server:<password> <-- use just the <password> part

Step 2: Uninstall and reinstall each agent using Ansible (one at a time):

Terminal window
# Uninstall:
ansible-playbook -i inventory.yml reset.yml \
--limit <agent_ip> -e "ansible_become_pass=<pass>"
# Reinstall:
ansible-playbook -i inventory.yml site.yml \
--limit <agent_ip> -e "ansible_become_pass=<pass>"

Step 3: Verify the agent has the correct CA:

Terminal window
ansible <agent_ip> -i inventory.yml -m shell \
-a "openssl x509 -noout -issuer \
< /var/lib/rancher/k3s/agent/server-ca.crt" \
-b -e "ansible_become_pass=<pass>"

Part 12: Control Plane Taint and Workload Redistribution

Section titled “Part 12: Control Plane Taint and Workload Redistribution”

Without taints or affinity rules, the k3s scheduler treats all nodes equally. On this cluster, the control plane node (rock1) ended up running 48 of ~68 total pods — the API server, etcd, scheduler, plus nearly every workload. The three agent nodes were underutilized. When rock1 ran out of memory, it cascaded: kubelet OOM-killed pods, the API server became unresponsive, and agents lost connection, marking rock2 and rock3 as NotReady.

k3s’s default scheduler uses resource requests to make placement decisions. If workload manifests don’t declare resource requests (or declare very small ones), the scheduler sees every node as equally available and tends to pack pods onto whichever node is most responsive — usually the control plane, because it’s always up first.

Fix: PreferNoSchedule taint on the control plane

Section titled “Fix: PreferNoSchedule taint on the control plane”

A PreferNoSchedule taint tells the scheduler to avoid placing pods on the node unless no other node can accommodate them. Unlike NoSchedule, it won’t evict existing pods or break DaemonSets.

Apply immediately (runtime):

Terminal window
kubectl taint nodes rock1 node-role.kubernetes.io/control-plane=:PreferNoSchedule

Persist across reinstalls by adding the taint to the Ansible inventory:

inventory.yml
extra_server_args: --disable traefik --disable servicelb --node-taint node-role.kubernetes.io/control-plane=:PreferNoSchedule

The k3s Ansible role templates this into the k3s server service file via extra_server_args, so the taint is applied automatically on every server start.

After applying the taint, existing pods are not evicted (PreferNoSchedule is soft). To redistribute workloads, either restart deployments or drain and uncordon the node:

Terminal window
# Check pod distribution:
kubectl get pods -A -o wide --sort-by=.spec.nodeName | awk '{print $8}' | sort | uniq -c | sort -rn
# Before taint: rock1 had ~48 pods
# After taint + rolling restarts: rock1 dropped to ~3 pods (control plane components only)
Taint EffectBehaviorUse Case
PreferNoScheduleScheduler avoids the node but doesn’t guarantee itControl plane — keep workloads off but allow DaemonSets and overflow
NoScheduleNew pods won’t be scheduled unless they tolerate the taintDedicated nodes (e.g. GPU, monitoring)
NoExecuteEvicts existing pods that don’t tolerate the taintDraining a node, cordoning for maintenance

Part 13: Bitnami to Official Image Migration

Section titled “Part 13: Bitnami to Official Image Migration”

Bitnami periodically deletes old image tags from Docker Hub. In this cluster, Authentik’s PostgreSQL (bitnami/postgresql:15.4.0-debian-11-r45) and Redis (bitnami/redis:7.2.3-debian-11-r2) had been removed from the registry. The pods only worked because rock1 had the images cached locally from a previous pull. If rock1’s containerd cache was cleared, or if the pod was rescheduled to another node, it would fail with ImagePullBackOff.

Bitnami images use a custom directory layout and configuration system that differs from the official images. For PostgreSQL, Bitnami stores data at /bitnami/postgresql/data, generates postgresql.conf and pg_hba.conf externally, and runs as uid 1001. The official postgres:alpine image uses /var/lib/postgresql/data, expects PGDATA to be set explicitly, and runs as uid 70. Swapping to a newer Bitnami tag would work but keeps the dependency on Bitnami’s packaging. Since the cluster uses raw manifests (not Helm), there’s no benefit to the Bitnami wrapper.

Migration: PostgreSQL (Bitnami → official)

Section titled “Migration: PostgreSQL (Bitnami → official)”

The existing PVCs have data at the Bitnami layout. The key challenge is pointing the official image at the existing data without a full dump/restore.

Bitnami layout (on the PVC):

/bitnami/postgresql/
└── data/
├── PG_VERSION
├── base/
├── global/
├── pg_hba.conf
├── postgresql.conf
└── ...

Official image expectations:

  • PGDATA must point to the actual data directory
  • The data directory must contain postgresql.conf and pg_hba.conf (Bitnami generates these externally; official expects them inside the data directory)

Working configuration:

containers:
- name: postgres
image: postgres:15-alpine
env:
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data/pgdata
subPath: data # Points to the existing 'data/' subdirectory within the PVC

The subPath: data mount maps the PVC’s data/ directory (where Bitnami stored the actual PostgreSQL files) directly to the PGDATA path. PostgreSQL finds its existing data files and starts without needing initdb.

Redis is simpler because it’s typically used as a cache (no persistent data that needs migration):

containers:
- name: redis
image: redis:7.2-alpine
command: ["redis-server"]
args: ["--maxmemory", "128mb", "--maxmemory-policy", "allkeys-lru"]
securityContext:
allowPrivilegeEscalation: false
runAsNonRoot: true
runAsUser: 999
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]

The official redis:alpine image runs as uid 999. Bitnami’s runs as uid 1001. If you have a Redis PVC with AOF/RDB files, you’ll need to chown the files to 999 before switching.

After migration, pin all images to specific tags to prevent silent breakage:

ServiceOld ImageNew Image
Authentik PostgreSQLbitnami/postgresql:15.4.0-debian-11-r45postgres:15-alpine
Authentik Redisbitnami/redis:7.2.3-debian-11-r2redis:7.2-alpine
Vaultwarden PostgreSQLpostgres:17postgres:17.4-alpine
Matrix PostgreSQLpostgres:16-alpinepostgres:16-alpine (already pinned)

Prefer -alpine variants for smaller image size and reduced attack surface on ARM64.


SymptomLikely causeFix
Agent: x509: certificate signed by unknown authorityStale cached CA after server rebootStop agent, rm CA certs, start agent (Part 3)
Agent: x509 persists after clearing certsCA was rotated (server reinstall); token has old CA hashFull agent reinstall with stripped token (Part 11)
Agent: Unauthorized on API callsStale client certs signed by old CAClear certs and restart; if it persists, full reinstall (Part 11)
Watchdog creating a restart death loopNo cooldown + narrow TLS detectionDeploy updated watchdog with cooldown and server gate (Part 4)
Missing kube-proxy iptables rulesPower loss wiped in-memory iptablesDelete KUBE-PROXY-CANARY chains, or restart agent
InvalidDiskCapacity warningContainerd v2.1 + k3s 1.34 cosmetic bugIgnore (resolves with k3s upgrade)
tmpfs-noswap warningSwap is enabled on cgroup v2swapoff -a, remove from fstab
CgroupV1 warning on kubelet startNode still on cgroup v1Run the cgroup v2 migration (Part 2)
Node won’t boot after config changeCorrupted ubuntuEnv.txtBMC serial console recovery (Part 7)
Disk pressure on agent nodesJournal/containerd logs filling SD cardVacuum journals, clean rotated logs (Part 5)
Most pods scheduled on control plane nodeNo taints or affinity rulesApply PreferNoSchedule taint on server node (Part 12)
ImagePullBackOff for Bitnami imagesBitnami deletes old tags from Docker HubMigrate to official images (Part 13)
apt update fails with EXPKEYSIG 6E2DD2174FA1C3BACloudflare WARP repository GPG key expiredMove /etc/apt/sources.list.d/cloudflare-client.list aside: mv cloudflare-client.list cloudflare-client.list.disabled. Not critical — WARP isn’t required for cluster operation.
CrowdSec banning your IP for http-probingDashboard UI 404s exceeding scenario thresholdParser-level whitelist (see Traefik guide)
Need to check/manage CrowdSec without cscliN/AUse the Security Dashboard at security-k3s.example.com (see Traefik guide)
Prometheus pods crash-loopingKEDA vs Prometheus Operator replica conflictDelete the KEDA ScaledObject for Prometheus
ansible-playbooks/my-playbooks/
├── inventory.yml # Cluster inventory (server + 3 agents)
├── k3s-agent-health.yml # Health watchdog deployment
├── cgroup-v2-swap-off.yml # Cgroup v2 migration + swap removal
└── site.yml # Wrapper that imports playbooks
/boot/firmware/ # Per-node boot config
├── ubuntuEnv.txt # Kernel cmdline, DTB, overlays
├── boot.cmd # U-Boot script source
└── boot.scr # Compiled U-Boot script
/var/lib/rancher/k3s/
├── server/tls/ # Server-side TLS (only on rock1)
│ ├── server-ca.crt # Server CA — signs serving certs
│ ├── client-ca.crt # Client CA — signs kubelet certs
│ └── dynamic-cert.json # Dynamic listener cert (file cache)
└── agent/ # Agent-side (rock2-rock4)
├── server-ca.crt # Cached server CA (delete to refresh)
├── client-ca.crt # Cached client CA (delete to refresh)
└── kubelet.kubeconfig # Kubelet credentials
/usr/local/bin/k3s-health-check # Watchdog script (deployed by Ansible)
/etc/systemd/system/
├── k3s-health.timer # 5-minute watchdog timer
└── k3s-health.service # Oneshot service for watchdog