ARM64 k3s Cluster Operations: Turing Pi, Node Management, and Recovery
An operations guide for running a 4-node ARM64 k3s cluster on Turing RK1 compute modules in a Turing Pi 2 board. This covers the non-obvious parts of keeping a bare-metal homelab cluster healthy: boot configuration quirks on Rockchip BSP kernels, the cgroup v1 to v2 migration, k3s’s dynamic TLS certificate system and how it breaks agents after server reboots, kube-proxy iptables recovery after power loss, disk pressure from uncontrolled logs, and recovering bricked nodes through the BMC serial console.
Everything here was learned the hard way through actual cluster failures. Each section includes the root cause analysis, the fix, and the gotchas encountered along the way.
Cluster Layout
Section titled “Cluster Layout”| Node | IP | k3s Role | Turing Pi Slot | Notes |
|---|---|---|---|---|
| node1 | 10.0.0.9 | control-plane (server) | 1 | API server, etcd, scheduler |
| node2 | 10.0.0.10 | agent | 2 | |
| node3 | 10.0.0.11 | agent | 3 | NFS server (~460 GB at /data) |
| node4 | 10.0.0.12 | agent | 4 |
All nodes: Turing RK1 (Rockchip RK3588, 8 GB RAM, 29 GB SD card), Ubuntu 22.04, kernel 5.10.160-rockchip (BSP), k3s v1.34.3+k3s3, containerd v2.1.5.
Component Versions
Section titled “Component Versions”| Component | Version |
|---|---|
| k3s | v1.34.3+k3s3 |
| Containerd | v2.1.5-k3s1 |
| Kernel | 5.10.160-rockchip (BSP) |
| Ubuntu | 22.04 LTS (Jammy) |
| U-Boot | Rockchip (vendor) |
| Turing Pi BMC | (accessible at BMC IP on the LAN) |
tpi CLI | Latest from Turing Pi |
Ansible Inventory
Section titled “Ansible Inventory”k3s_cluster: children: server: hosts: 10.0.0.9: agent: hosts: 10.0.0.10: 10.0.0.11: 10.0.0.12: vars: ansible_port: 22 ansible_user: your_user ansible_python_interpreter: /usr/bin/python3 k3s_version: v1.34.3+k3s3 extra_server_args: --disable traefik --disable servicelbThe server has --disable traefik --disable servicelb because both are replaced with custom deployments (see the Traefik and monitoring guides).
Part 1: Boot Configuration
Section titled “Part 1: Boot Configuration”The Turing RK1 modules run a Rockchip BSP U-Boot that loads its configuration from /boot/firmware/. The key files:
| File | Purpose |
|---|---|
/boot/firmware/ubuntuEnv.txt | Kernel cmdline (bootargs=), DTB file, overlays |
/boot/firmware/boot.cmd | U-Boot script source (human-readable) |
/boot/firmware/boot.scr | Compiled U-Boot script (binary, loaded by U-Boot) |
To change kernel boot parameters, edit ubuntuEnv.txt. You do not need to recompile boot.scr — U-Boot reads ubuntuEnv.txt at boot and substitutes the variables into the boot script.
ubuntuEnv.txt format
Section titled “ubuntuEnv.txt format”bootargs=root=UUID=<uuid> rootfstype=ext4 rootwait rw console=ttyS9,115200 console=ttyS2,1500000 console=tty1 systemd.unified_cgroup_hierarchy=1fdtfile=rk3588-turing-rk1.dtboverlay_prefix=rk3588overlays=Part 2: Cgroup v2 Migration
Section titled “Part 2: Cgroup v2 Migration”The RK1 modules ship with cgroup v1 enabled explicitly in the kernel cmdline. Kubernetes and containerd have deprecated cgroup v1 support and the kernel 5.10 Rockchip BSP supports cgroup v2 with the systemd unified hierarchy.
What needs to change
Section titled “What needs to change”The original bootargs contain these cgroup v1 flags:
cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory swapaccount=1 systemd.unified_cgroup_hierarchy=0These need to be replaced with:
systemd.unified_cgroup_hierarchy=1Swap also needs to be disabled — k3s on cgroup v2 with swap enabled produces tmpfs-noswap warnings and kubelet considers memory-backed volumes (secrets, emptyDirs) insecure because they could be swapped to disk.
Migration playbook
Section titled “Migration playbook”The playbook processes agents first (one at a time, rolling), then the server last. Each node is drained, rebooted, verified, and uncordoned before moving to the next.
- name: Disable swap and enable cgroup v2 on agent nodes hosts: agent become: yes serial: 1 tasks: - name: Disable swap immediately ansible.builtin.command: swapoff -a changed_when: true
- name: Remove swap entry from fstab ansible.builtin.lineinfile: path: /etc/fstab regexp: '^\s*/swapfile\s' state: absent
- name: Delete swapfile ansible.builtin.file: path: /swapfile state: absent
- name: Update bootargs — remove cgroup v1 flags and enable cgroup v2 ansible.builtin.replace: path: /boot/firmware/ubuntuEnv.txt # NOTE: Use literal ' *' at end, NOT '\s*' — see Boot Configuration section regexp: 'cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory swapaccount=1 systemd\.unified_cgroup_hierarchy=0 *' replace: "systemd.unified_cgroup_hierarchy=1 "
- name: Drain node before reboot delegate_to: localhost become: no ansible.builtin.command: > kubectl drain {{ ansible_hostname }} --ignore-daemonsets --delete-emptydir-data --timeout=120s --force
- name: Reboot into cgroup v2 ansible.builtin.reboot: reboot_timeout: 300 msg: "Rebooting for cgroup v2 + swap off" pre_reboot_delay: 5 post_reboot_delay: 30
- name: Restart k3s-agent to refresh certificates ansible.builtin.systemd: name: k3s-agent state: restarted
- name: Wait for k3s-agent to be active ansible.builtin.shell: systemctl is-active k3s-agent register: k3s_status until: k3s_status.rc == 0 retries: 18 delay: 10 changed_when: false
- name: Uncordon node delegate_to: localhost become: no ansible.builtin.command: kubectl uncordon {{ ansible_hostname }}
- name: Wait for node to be Ready delegate_to: localhost become: no ansible.builtin.shell: > kubectl get node {{ ansible_hostname }} -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' register: node_ready until: node_ready.stdout == "True" retries: 30 delay: 10 changed_when: false
- name: Verify cgroup v2 is active ansible.builtin.shell: stat -f --format=%T /sys/fs/cgroup register: cgroupfs changed_when: false failed_when: cgroupfs.stdout != "cgroup2fs"
- name: Verify swap is off ansible.builtin.command: swapon --show register: swap_check changed_when: false failed_when: swap_check.stdout | length > 0
- name: Disable swap and enable cgroup v2 on server node hosts: server become: yes tasks: # Same tasks as above, but: # - service name is 'k3s' not 'k3s-agent' # - post_reboot_delay is 60 (server takes longer — etcd bootstrap) # - server is done last to avoid cascading agent failuresPost-migration: agent TLS cert refresh
Section titled “Post-migration: agent TLS cert refresh”After the server reboots, its dynamic listener TLS cert may change. Agents that booted before or concurrently with the server will cache the old CA and fail with x509: certificate signed by unknown authority. The playbook includes a Restart k3s-agent to refresh certificates step to handle this, but it’s not always sufficient. See Part 3 for the full explanation and fix.
Verification
Section titled “Verification”# On each node:stat -f --format=%T /sys/fs/cgroup# Expected: cgroup2fs
swapon --show# Expected: (empty output)
free -h | grep Swap# Expected: Swap: 0B 0B 0BKnown cosmetic issue
Section titled “Known cosmetic issue”After migration, containerd v2.1 on k3s 1.34 produces an InvalidDiskCapacity warning on every kubelet start:
Warning InvalidDiskCapacity kubelet invalid capacity 0 on image filesystemThis is caused by disable_snapshot_annotations = true in the auto-generated containerd config. It’s cosmetic — disk capacity reporting works fine. It will resolve with a future k3s upgrade.
Part 3: k3s TLS Certificate Recovery
Section titled “Part 3: k3s TLS Certificate Recovery”This is the single most disruptive issue on this cluster. After a server reboot (or power cycle), agents frequently get stuck with:
tls: failed to verify certificate: x509: certificate signed by unknown authorityWhy it happens
Section titled “Why it happens”k3s has two separate CA systems and a dynamic TLS certificate:
| Component | Path (server) | Purpose |
|---|---|---|
| Server CA | /var/lib/rancher/k3s/server/tls/server-ca.crt | Signs the API server’s serving certificate |
| Client CA | /var/lib/rancher/k3s/server/tls/client-ca.crt | Signs kubelet/kube-proxy client certs |
| Dynamic listener cert | /var/lib/rancher/k3s/server/tls/dynamic-cert.json | The actual TLS cert presented on port 6443, managed by rancher/dynamiclistener |
Agents cache the server CA and client CA locally:
| Component | Path (agent) |
|---|---|
| Cached server CA | /var/lib/rancher/k3s/agent/server-ca.crt |
| Cached client CA | /var/lib/rancher/k3s/agent/client-ca.crt |
The dynamic listener cert is stored in three places: a Kubernetes Secret (k3s-serving in kube-system), a local file cache, and in memory. On server startup, before etcd is available, the dynamic listener uses the file cache. If this cert doesn’t match what agents expect (because the cert was regenerated, or the agent has a stale CA from a previous boot), agents can’t verify the server and enter a TLS error loop.
The critical point: systemctl restart k3s-agent does not fix this. The agent reloads the same stale CA certs from disk. You must delete the cached CA files so the agent re-fetches them from the server’s /cacerts endpoint.
Recovery procedure (stale cert cache)
Section titled “Recovery procedure (stale cert cache)”This procedure fixes the common case where the server CA has NOT changed but agents have stale cached copies (e.g. after a server reboot or power cycle).
Step 1: Verify the server is up and the API is working:
# From the server node itself:kubectl get nodes --kubeconfig /etc/rancher/k3s/k3s.yaml
# Or via ansible:ansible server -i inventory.yml -m shell \ -a "kubectl get nodes --kubeconfig /etc/rancher/k3s/k3s.yaml" \ -b -e "ansible_become_pass=<pass>"Step 2: On each broken agent, stop the agent, delete cached certs, start:
# Via ansible (one node at a time):ansible <agent_ip> -i inventory.yml -m shell \ -a "systemctl stop k3s-agent; \ sleep 2; \ rm -f /var/lib/rancher/k3s/agent/server-ca.crt \ /var/lib/rancher/k3s/agent/client-ca.crt; \ nohup systemctl start k3s-agent &" \ -b -e "ansible_become_pass=<pass>"Step 3: Wait 60-90 seconds, then verify:
kubectl get nodesVerifying the CA match
Section titled “Verifying the CA match”If an agent is stuck and you’re not sure whether it’s a cert issue, compare the CA fingerprints:
# On the server — the authoritative CA:openssl x509 -in /var/lib/rancher/k3s/server/tls/server-ca.crt \ -noout -fingerprint -sha256
# On the agent — what it has cached:openssl x509 -in /var/lib/rancher/k3s/agent/server-ca.crt \ -noout -fingerprint -sha256
# What the server's /cacerts endpoint returns:curl -sk https://<server_ip>:6443/cacerts | \ openssl x509 -noout -fingerprint -sha256If the agent’s fingerprint doesn’t match the server’s, the cert is stale.
Long-term mitigation
Section titled “Long-term mitigation”| Approach | Difficulty | Effect |
|---|---|---|
| Custom CA certs (pre-created before first server start) | High (requires reinstall) | CA never changes across reboots |
--tls-san on server config | Low | Reduces dynamic cert regeneration |
| Health watchdog with TLS detection | Medium | Auto-recovers agents (see Part 4) |
| Boot order: server first, agents after | Low | Avoids race condition |
Part 4: Health Watchdog
Section titled “Part 4: Health Watchdog”A systemd timer that runs every 5 minutes on each node and detects two failure modes:
- Missing kube-proxy iptables rules (common after power loss) — attempts a non-disruptive canary chain resync before falling back to a service restart
- Any API failure (stale certs, unauthorized, connection refused) — always clears cached CA certs before restarting
Key safety features:
- 10-minute cooldown: skips if the agent was restarted recently, preventing a death loop
- Server reachability gate: won’t restart the agent if the server API is unreachable (would just get bad certs)
Agent watchdog script
Section titled “Agent watchdog script”#!/bin/bash# k3s agent health watchdog
SERVICE=k3s-agentKUBECONFIG=/var/lib/rancher/k3s/agent/kubelet.kubeconfigNODENAME=$(hostname)SERVER_URL="https://10.0.0.1:6443" # Replace with your server IPCOOLDOWN_SECONDS=600 # 10 minutesCERT_DIR=/var/lib/rancher/k3s/agent
# ---- Cooldown check ----# If the agent was (re)started recently, don't interfere — give it# time to stabilize.ACTIVE_ENTER=$(systemctl show "$SERVICE" \ --property=ActiveEnterTimestamp --value 2>/dev/null)if [ -n "$ACTIVE_ENTER" ]; then ENTER_EPOCH=$(date -d "$ACTIVE_ENTER" +%s 2>/dev/null || echo 0) NOW_EPOCH=$(date +%s) AGE=$(( NOW_EPOCH - ENTER_EPOCH )) if [ "$AGE" -lt "$COOLDOWN_SECONDS" ]; then logger -t k3s-watchdog \ "Agent started ${AGE}s ago (cooldown ${COOLDOWN_SECONDS}s) — skipping" exit 0 fifi
if ! systemctl is-active --quiet "$SERVICE"; then logger -t k3s-watchdog "Agent is not active — skipping" exit 0fi
# ---- Check 1: kube-proxy iptables rules ----if [ "$(iptables-save 2>/dev/null | grep -c KUBE-SVC)" -eq 0 ]; then logger -t k3s-watchdog "kube-proxy iptables rules missing — deleting canary chains" iptables -t mangle -F KUBE-PROXY-CANARY 2>/dev/null iptables -t mangle -X KUBE-PROXY-CANARY 2>/dev/null iptables -t nat -F KUBE-PROXY-CANARY 2>/dev/null iptables -t nat -X KUBE-PROXY-CANARY 2>/dev/null iptables -t filter -F KUBE-PROXY-CANARY 2>/dev/null iptables -t filter -X KUBE-PROXY-CANARY 2>/dev/null sleep 40 if [ "$(iptables-save 2>/dev/null | grep -c KUBE-SVC)" -eq 0 ]; then logger -t k3s-watchdog "Canary resync failed — will restart in API check" else logger -t k3s-watchdog "kube-proxy iptables rules restored via canary resync" fifi
# ---- Check 2: API reachability ----# Use "get node $NODENAME" not "get nodes" — the kubelet kubeconfig# (system:node:<name>) only has RBAC to read its own node object.API_ERR=$(kubectl --kubeconfig="$KUBECONFIG" get node "$NODENAME" 2>&1)API_RC=$?if [ $API_RC -eq 0 ]; then exit 0fi
logger -t k3s-watchdog "API check failed for $NODENAME (rc=$API_RC): $API_ERR"
# ---- Server reachability gate ----if ! curl -sk --max-time 5 "$SERVER_URL/cacerts" >/dev/null 2>&1; then logger -t k3s-watchdog "Server API not reachable — skipping restart" exit 0fi
# ---- Restart with cert cleanup ----# Always clear cached CA certs. Stale certs manifest as various errors:# - x509: certificate signed by unknown authority# - connection refused (local proxy can't auth to server)# - 401 Unauthorized (server rejects stale client cert)logger -t k3s-watchdog "Clearing cached CA certs and restarting $SERVICE"systemctl stop "$SERVICE"sleep 2rm -f "$CERT_DIR/client-ca.crt" "$CERT_DIR/server-ca.crt"systemctl start "$SERVICE"logger -t k3s-watchdog "Agent restarted with fresh certs"Server watchdog script
Section titled “Server watchdog script”The server version only checks kube-proxy iptables. It intentionally does not check API auth or restart the k3s server, because restarting the server would invalidate all agent tokens and cause a cascading failure.
#!/bin/bash# k3s server health watchdog
SERVICE=k3s
if ! systemctl is-active --quiet "$SERVICE"; then exit 0fi
if [ "$(iptables-save 2>/dev/null | grep -c KUBE-SVC)" -eq 0 ]; then logger -t k3s-watchdog "kube-proxy iptables rules missing on server" iptables -t mangle -F KUBE-PROXY-CANARY 2>/dev/null iptables -t mangle -X KUBE-PROXY-CANARY 2>/dev/null iptables -t nat -F KUBE-PROXY-CANARY 2>/dev/null iptables -t nat -X KUBE-PROXY-CANARY 2>/dev/null iptables -t filter -F KUBE-PROXY-CANARY 2>/dev/null iptables -t filter -X KUBE-PROXY-CANARY 2>/dev/null sleep 40 if [ "$(iptables-save 2>/dev/null | grep -c KUBE-SVC)" -eq 0 ]; then logger -t k3s-watchdog "Canary resync failed — restarting $SERVICE as last resort" systemctl restart "$SERVICE" fifiDeployment playbook
Section titled “Deployment playbook”The watchdog is deployed via Ansible as an inline copy task — the script is embedded directly in the playbook rather than maintained as a separate file:
- name: Deploy k3s health watchdog on agent nodes hosts: agent become: yes tasks: - name: Install k3s health check script ansible.builtin.copy: dest: /usr/local/bin/k3s-health-check mode: "0755" content: | #!/bin/bash # (full script from above)
- name: Install systemd timer for health check ansible.builtin.copy: dest: /etc/systemd/system/k3s-health.timer content: | [Unit] Description=k3s health watchdog timer
[Timer] OnBootSec=3min OnUnitActiveSec=5min AccuracySec=30s
[Install] WantedBy=timers.target
- name: Install systemd service for health check ansible.builtin.copy: dest: /etc/systemd/system/k3s-health.service content: | [Unit] Description=k3s health watchdog After=k3s-agent.service
[Service] Type=oneshot ExecStart=/usr/local/bin/k3s-health-check
- name: Reload systemd and enable timer ansible.builtin.systemd: daemon_reload: yes
- name: Enable and start the health check timer ansible.builtin.systemd: name: k3s-health.timer enabled: yes state: startedChecking watchdog logs
Section titled “Checking watchdog logs”# View recent watchdog actions:journalctl -t k3s-watchdog --no-pager --since '1 hour ago'
# Check timer status:systemctl list-timers k3s-health.timerPart 5: Disk Maintenance
Section titled “Part 5: Disk Maintenance”With 29 GB SD cards, disk space is a constant concern. The main consumers are journal logs, containerd rotated logs, and monitoring data.
Journal size limits
Section titled “Journal size limits”By default, journald can grow to 2.7-2.8 GB per node. Set a permanent limit:
[Journal]SystemMaxUse=256MDeploy and apply:
ansible all -i inventory.yml -m copy \ -a "dest=/etc/systemd/journald.conf.d/size.conf content='[Journal]\nSystemMaxUse=256M\n'" \ -b -e "ansible_become_pass=<pass>"
ansible all -i inventory.yml -m shell \ -a "journalctl --vacuum-size=256M && systemctl restart systemd-journald" \ -b -e "ansible_become_pass=<pass>"Containerd log rotation
Section titled “Containerd log rotation”Containerd creates rotated logs (*.gz files) in /var/lib/rancher/k3s/agent/containerd/. These accumulate silently. Clean them periodically:
ansible all -i inventory.yml -m shell \ -a "find /var/lib/rancher/k3s/agent/containerd/ -name '*.gz' -delete" \ -b -e "ansible_become_pass=<pass>"Syslog rotation
Section titled “Syslog rotation”Ubuntu’s rsyslog can also accumulate large rotated logs. Configure aggressive rotation:
/var/log/syslog /var/log/kern.log /var/log/auth.log { daily rotate 3 compress delaycompress missingok notifempty maxsize 50M}Loki TSDB storage
Section titled “Loki TSDB storage”Loki’s TSDB shipper writes index and cache data. If configured with an emptyDir volume, this writes to the node’s disk and causes disk pressure. Move it to NFS:
# In the Loki configmap, change tsdb_shipper paths:schema_config: configs: - from: "2024-09-01" store: tsdb object_store: filesystem schema: v13 index: prefix: index_ period: 24h
storage_config: tsdb_shipper: active_index_directory: /loki/tsdb-index # was /local/tsdb-index cache_location: /loki/tsdb-cache # was /local/tsdb-cache filesystem: directory: /loki/chunksRemove the emptyDir volume from the StatefulSet and make sure /loki/ is backed by the NFS PVC.
Part 6: Swap Removal
Section titled “Part 6: Swap Removal”Swap is disabled as part of the cgroup v2 migration (see Part 2), but if you need to do it separately:
# Immediate:swapoff -a
# Permanent — remove from fstab:sed -i '/\/swapfile/d' /etc/fstab
# Reclaim disk space:rm -f /swapfile # Frees ~2 GB per nodeThe RK1 modules have 8 GB RAM, which is enough for the workloads running on this cluster. With swap enabled on cgroup v2, kubelet produces tmpfs-noswap warnings because it can’t guarantee that memory-backed volumes (Kubernetes secrets, emptyDirs) stay in RAM.
Part 7: BMC Serial Console Recovery
Section titled “Part 7: BMC Serial Console Recovery”When a node won’t boot (e.g. due to a corrupted ubuntuEnv.txt — see the regex gotcha in Part 1), you can recover it through the Turing Pi BMC’s serial console without physically accessing the board.
Prerequisites
Section titled “Prerequisites”The tpi CLI tool must be installed and configured to talk to the BMC’s IP address on your LAN.
Accessing the serial console
Section titled “Accessing the serial console”# Read serial output from slot 2 (rock2):tpi uart -n 2 get
# Send a command to the serial console:tpi uart -n 2 set -c 'ls /boot/firmware/'Recovery scenario: corrupted ubuntuEnv.txt
Section titled “Recovery scenario: corrupted ubuntuEnv.txt”This happened when an Ansible regex consumed a newline, merging fdtfile=rk3588-turing-rk1.dtb into the bootargs= line. U-Boot couldn’t find the DTB and the node dropped to a U-Boot shell.
Step 1: Power cycle the node and catch the U-Boot prompt:
tpi power -n 2 offsleep 2tpi power -n 2 on# Watch serial output:tpi uart -n 2 getStep 2: If the node drops to a U-Boot shell, boot manually:
# Load the kernel, DTB, and initrd from the SD card:tpi uart -n 2 set -c 'load mmc 1:1 ${kernel_addr_r} /boot/vmlinuz'tpi uart -n 2 set -c 'load mmc 1:1 ${fdt_addr_r} /boot/dtbs/5.10.160-rockchip/rockchip/rk3588-turing-rk1.dtb'tpi uart -n 2 set -c 'load mmc 1:1 ${ramdisk_addr_r} /boot/initrd.img'tpi uart -n 2 set -c 'booti ${kernel_addr_r} ${ramdisk_addr_r} ${fdt_addr_r}'Step 3: Once Linux boots, SSH in and fix ubuntuEnv.txt:
sudo vi /boot/firmware/ubuntuEnv.txt# Ensure bootargs and fdtfile are on separate linessudo rebootPower management
Section titled “Power management”# Power cycle a single slot:tpi power -n <slot> off && sleep 2 && tpi power -n <slot> on
# Power cycle all slots:tpi power -n 1 off && tpi power -n 2 off && tpi power -n 3 off && tpi power -n 4 offsleep 2tpi power -n 1 on && tpi power -n 2 on && tpi power -n 3 on && tpi power -n 4 onPart 8: Prometheus Resource Tuning
Section titled “Part 8: Prometheus Resource Tuning”On a cluster with ~265k active series, 59 scrape targets, and a 30s scrape interval, Prometheus uses approximately 1 GB of memory at steady state.
Memory request
Section titled “Memory request”The default memory request from many Helm charts (512Mi) is too low. Set it to match actual usage to prevent OOM kills and ensure the scheduler places the pod on a node with sufficient capacity:
# In the Prometheus CR or manifest:resources: requests: cpu: 200m memory: 1Gi limits: memory: 2GiMonitoring resource usage
Section titled “Monitoring resource usage”To right-size Prometheus, query its own metrics:
# Current RSS:process_resident_memory_bytes{job="prometheus"}
# Active series count:prometheus_tsdb_head_series
# Scrape target count:count(up)
# Ingestion rate:rate(prometheus_tsdb_head_samples_appended_total[5m])Part 9: Stale Pod Cleanup
Section titled “Part 9: Stale Pod Cleanup”After node reboots, agent restarts, or operator reconciliation conflicts, pods can be left in Error or Completed state. These consume API server resources and clutter kubectl get pods output.
Cleaning up
Section titled “Cleaning up”# Delete all Error pods cluster-wide:kubectl get pods -A --field-selector=status.phase=Failed \ -o jsonpath='{range .items[*]}{.metadata.namespace}{" "}{.metadata.name}{"\n"}{end}' | \ while read ns name; do kubectl delete pod -n "$ns" "$name"; done
# Delete all Completed (Succeeded) pods:kubectl get pods -A --field-selector=status.phase=Succeeded \ -o jsonpath='{range .items[*]}{.metadata.namespace}{" "}{.metadata.name}{"\n"}{end}' | \ while read ns name; do kubectl delete pod -n "$ns" "$name"; doneOr more concisely:
kubectl delete pods -A --field-selector=status.phase=Failedkubectl delete pods -A --field-selector=status.phase=SucceededPart 10: KEDA and Prometheus Conflicts
Section titled “Part 10: KEDA and Prometheus Conflicts”KEDA’s ScaledObjects can conflict with Kubernetes operators that manage the same workloads.
Prometheus Operator conflict
Section titled “Prometheus Operator conflict”If a KEDA ScaledObject targets the prometheus-prometheus StatefulSet directly, KEDA will scale it (e.g. to 4 replicas), but the Prometheus Operator’s Prometheus CR has replicas: 1. The operator reconciles constantly, trying to scale back down, while KEDA keeps scaling up. This creates crash-looping pods with volume mount failures because the PVCs don’t exist for the extra replicas.
Fix: Delete the KEDA ScaledObject for Prometheus. Let the Prometheus Operator manage the replica count through its CR. If you need Prometheus autoscaling, do it through the Prometheus CR’s replicas field or use a VPA instead.
Prometheus service address
Section titled “Prometheus service address”KEDA ScaledObjects that use Prometheus as a trigger source need the correct service address. The Prometheus Operator creates a service called prometheus-operated (headless) and you typically create a ClusterIP service with a shorter name. Make sure the serverAddress in ScaledObject triggers matches your actual service:
triggers: - type: prometheus metadata: serverAddress: http://prometheus.monitoring.svc.cluster.local:9090 # NOT: prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090Part 11: Token and CA Rotation
Section titled “Part 11: Token and CA Rotation”The K10 token format
Section titled “The K10 token format”When k3s first starts, it generates a node-join token stored at /var/lib/rancher/k3s/server/token:
K10<ca_hash>::server:<password>The K10 prefix tells the agent to verify the server’s CA fingerprint against <ca_hash> during bootstrap. The <password> is the actual authentication credential.
What breaks after a server reinstall
Section titled “What breaks after a server reinstall”If you reinstall or reset the k3s server (e.g. k3s-uninstall.sh then fresh install), a new CA is generated. But:
- The
node-tokenfile may keep the old CA hash - The agent service files (created by Ansible or the installer) still have the old
K10<old_hash>::server:<password>token - The agent’s private keys and client certs (under
/var/lib/rancher/k3s/agent/) were signed by the old CA
The result is agents that appear to connect intermittently but fail authentication. The error messages vary depending on timing:
x509: certificate signed by unknown authority(TLS handshake)401 Unauthorized(server rejects old client cert)connection refusedon127.0.0.1:6443(local proxy can’t authenticate to server)
Simply deleting cached CA certs (server-ca.crt, client-ca.crt) does not fix this because the agent re-bootstraps using the old token and old private keys, getting certs signed by the old CA again.
Diagnosis
Section titled “Diagnosis”Compare the CA timestamps:
# Server's current CA:openssl x509 -noout -issuer \ < /var/lib/rancher/k3s/server/tls/server-ca.crt# e.g. issuer=CN = k3s-server-ca@1755079458
# Agent's cached CA:openssl x509 -noout -issuer \ < /var/lib/rancher/k3s/agent/server-ca.crt# e.g. issuer=CN = k3s-server-ca@1709313234 <-- MISMATCH
# Agent's client cert (who signed it?):openssl x509 -noout -issuer \ < /var/lib/rancher/k3s/agent/client-kubelet.crt# Should match the server's client-ca.crtIf the @timestamp values differ, the CA was rotated and agents need a full reinstall.
Fix: strip the CA pin and reinstall agents
Section titled “Fix: strip the CA pin and reinstall agents”Step 1: Update the token in your Ansible inventory to use only the password (no K10 prefix):
vars: # Do NOT use the K10<hash> form. It pins to a specific CA and # breaks when the server CA rotates. token: "<your_server_password>"You can find the password portion from the server’s token file:
# On the server:cat /var/lib/rancher/k3s/server/token# K10<hash>::server:<password> <-- use just the <password> partStep 2: Uninstall and reinstall each agent using Ansible (one at a time):
# Uninstall:ansible-playbook -i inventory.yml reset.yml \ --limit <agent_ip> -e "ansible_become_pass=<pass>"
# Reinstall:ansible-playbook -i inventory.yml site.yml \ --limit <agent_ip> -e "ansible_become_pass=<pass>"Step 3: Verify the agent has the correct CA:
ansible <agent_ip> -i inventory.yml -m shell \ -a "openssl x509 -noout -issuer \ < /var/lib/rancher/k3s/agent/server-ca.crt" \ -b -e "ansible_become_pass=<pass>"Troubleshooting Quick Reference
Section titled “Troubleshooting Quick Reference”| Symptom | Likely cause | Fix |
|---|---|---|
Agent: x509: certificate signed by unknown authority | Stale cached CA after server reboot | Stop agent, rm CA certs, start agent (Part 3) |
Agent: x509 persists after clearing certs | CA was rotated (server reinstall); token has old CA hash | Full agent reinstall with stripped token (Part 11) |
Agent: Unauthorized on API calls | Stale client certs signed by old CA | Clear certs and restart; if it persists, full reinstall (Part 11) |
| Watchdog creating a restart death loop | No cooldown + narrow TLS detection | Deploy updated watchdog with cooldown and server gate (Part 4) |
| Missing kube-proxy iptables rules | Power loss wiped in-memory iptables | Delete KUBE-PROXY-CANARY chains, or restart agent |
InvalidDiskCapacity warning | Containerd v2.1 + k3s 1.34 cosmetic bug | Ignore (resolves with k3s upgrade) |
tmpfs-noswap warning | Swap is enabled on cgroup v2 | swapoff -a, remove from fstab |
CgroupV1 warning on kubelet start | Node still on cgroup v1 | Run the cgroup v2 migration (Part 2) |
| Node won’t boot after config change | Corrupted ubuntuEnv.txt | BMC serial console recovery (Part 7) |
| Disk pressure on agent nodes | Journal/containerd logs filling SD card | Vacuum journals, clean rotated logs (Part 5) |
CrowdSec banning your IP for http-probing | Dashboard UI 404s exceeding scenario threshold | Parser-level whitelist (see Traefik guide) |
Need to check/manage CrowdSec without cscli | N/A | Use the Security Dashboard at security-k3s.example.com (see Traefik guide) |
| Prometheus pods crash-looping | KEDA vs Prometheus Operator replica conflict | Delete the KEDA ScaledObject for Prometheus |
File Reference
Section titled “File Reference”ansible-playbooks/my-playbooks/├── inventory.yml # Cluster inventory (server + 3 agents)├── k3s-agent-health.yml # Health watchdog deployment├── cgroup-v2-swap-off.yml # Cgroup v2 migration + swap removal└── site.yml # Wrapper that imports playbooks
/boot/firmware/ # Per-node boot config├── ubuntuEnv.txt # Kernel cmdline, DTB, overlays├── boot.cmd # U-Boot script source└── boot.scr # Compiled U-Boot script
/var/lib/rancher/k3s/├── server/tls/ # Server-side TLS (only on rock1)│ ├── server-ca.crt # Server CA — signs serving certs│ ├── client-ca.crt # Client CA — signs kubelet certs│ └── dynamic-cert.json # Dynamic listener cert (file cache)└── agent/ # Agent-side (rock2-rock4) ├── server-ca.crt # Cached server CA (delete to refresh) ├── client-ca.crt # Cached client CA (delete to refresh) └── kubelet.kubeconfig # Kubelet credentials
/usr/local/bin/k3s-health-check # Watchdog script (deployed by Ansible)/etc/systemd/system/├── k3s-health.timer # 5-minute watchdog timer└── k3s-health.service # Oneshot service for watchdog