Skip to content

Traefik on k3s: Custom Deployment, Plugins, Middlewares, and Cloudflare Tunnel

A complete guide to replacing k3s’s built-in Traefik with a fully custom deployment on a 4-node ARM64 homelab cluster. The built-in Traefik is fine for simple setups, but it doesn’t support local plugins, has limited middleware configuration, and doesn’t expose the level of control needed for things like bot detection, request body decompression, or per-route rate limiting.

This guide covers the full setup: disabling the built-in Traefik, deploying a custom one as a raw Deployment manifest, writing and packaging Traefik Go plugins as ConfigMaps, configuring the middleware chain, managing TLS certificates via Cloudflare DNS challenge, routing traffic through Cloudflare Tunnel, autoscaling with KEDA, and piping access logs + traces into the monitoring stack.


All HTTP traffic enters through Cloudflare’s edge network, passes through a Cloudflare Tunnel (cloudflared running in the cluster), and hits the custom Traefik deployment in the traefik namespace. Traefik terminates TLS (ACME certs via Cloudflare DNS challenge), runs the global middleware chain (sentinel → security-headers), then routes to per-route middlewares and backend services.

d2 diagram
ComponentVersionImage
Traefikv3.6.8traefik:v3.6.8
cloudflared2026.2.0cloudflare/cloudflared:2026.2.0
KEDA(cluster-wide)(already deployed)

k3s ships with Traefik as a bundled Helm chart. It auto-deploys on the server node and manages its own CRDs. To run a custom Traefik, the built-in one must be fully disabled — otherwise you get two Traefik instances fighting over the same IngressRoutes.

ansible-playbooks/my-playbooks/disable-builtin-traefik.yml
---
- name: Disable k3s built-in Traefik and ServiceLB on server
hosts: server
become: yes
tasks:
- name: Add disable directives to k3s config.yaml
ansible.builtin.blockinfile:
path: /etc/rancher/k3s/config.yaml
marker: "# {mark} ANSIBLE MANAGED - disable built-in addons"
block: |
disable:
- traefik
- servicelb
create: no
register: config_changed
- name: Remove k3s bundled traefik manifest files
ansible.builtin.file:
path: "{{ item }}"
state: absent
loop:
- /var/lib/rancher/k3s/server/manifests/traefik.yaml
- /var/lib/rancher/k3s/server/static/charts/traefik-crd-38.0.201+up38.0.2.tgz
- /var/lib/rancher/k3s/server/static/charts/traefik-38.0.201+up38.0.2.tgz
register: manifests_removed
- name: Restart k3s to pick up config change
ansible.builtin.systemd:
name: k3s
state: restarted
daemon_reload: yes
when: config_changed.changed
- name: Wait for k3s API to be ready after restart
ansible.builtin.wait_for:
port: 6443
host: "{{ ansible_host }}"
delay: 10
timeout: 120
when: config_changed.changed

Run it:

Terminal window
ansible-playbook -i inventory.yml \
ansible-playbooks/my-playbooks/disable-builtin-traefik.yml \
--become --ask-become-pass

Safe to re-run (idempotent). The playbook also removes stale chart tarballs from k3s’s static manifests directory — without this, k3s may re-deploy the built-in Traefik on restart even with disable set.


Traefik’s Kubernetes CRD provider needs its own CRD definitions (IngressRoute, Middleware, TLSOption, etc.) and RBAC permissions. These are separate from the Traefik Deployment itself and must be applied first.

Terminal window
# Apply Traefik CRDs (one-time, or on Traefik version upgrades)
kubectl apply -f crds/kubernetes-crd-definition-v1.yml --server-side
kubectl apply -f crds/kubernetes-crd-rbac.yml

The CRD file is large (~3.5 MB) and requires --server-side due to the annotation size limit. RBAC grants the Traefik ServiceAccount read access to IngressRoutes, Middlewares, TLSOptions, Services, Secrets, EndpointSlices, and related resources across both traefik.io and the legacy traefik.containo.us API groups.


The entire Traefik deployment lives in a single manifest: services/traefik.yaml. It contains a ServiceAccount, ClusterRole, ClusterRoleBinding, LoadBalancer Service, Deployment, IngressClass, and PodDisruptionBudget.

Five entrypoints handle different traffic types:

EntrypointAddressProtocolPurpose
web:8000/tcpHTTPRedirect to HTTPS (unused behind tunnel)
websecure:8443HTTPS + HTTP/3 + QUICAll production traffic
metrics:8082/tcpHTTPPrometheus metrics scrape endpoint
traefik:9000/tcpHTTPDashboard API + health checks (/ping)
jvb-udp:10000/udpUDPJitsi Videobridge media

The websecure entrypoint is the workhorse. Key settings:

args:
- "--entrypoints.websecure.address=:8443"
- "--entrypoints.websecure.http.tls=true"
- "--entrypoints.websecure.http.tls.certResolver=cloudflare"
- "--entrypoints.websecure.http3=true"
- "--entrypoints.websecure.http3.advertisedport=443"
- "--entrypoints.websecure.http2.maxConcurrentStreams=512"
# Global middlewares applied to ALL websecure requests
- "--entrypoints.websecure.http.middlewares=traefik-sentinel@kubernetescrd,traefik-security-headers@kubernetescrd"

HTTP/3 is enabled with advertisedport=443 because the container listens on 8443 but the LoadBalancer Service maps port 443 → 8443. Without the advertised port, clients would try QUIC on port 8443 and fail.

args:
- "--entrypoints.websecure.transport.respondingTimeouts.readTimeout=60s"
- "--entrypoints.websecure.transport.respondingTimeouts.writeTimeout=0s"
- "--entrypoints.websecure.transport.respondingTimeouts.idleTimeout=180s"
- "--entrypoints.websecure.transport.lifeCycle.graceTimeOut=30s"
- "--entrypoints.websecure.transport.lifeCycle.requestAcceptGraceTimeout=5s"

writeTimeout=0s (disabled) is intentional. Matrix (Synapse), Jitsi, and LiveKit all use long-lived WebSocket connections. A non-zero write timeout would kill WebSocket connections that don’t send data within the timeout window. The tradeoff is that slowloris-style attacks against WebSocket endpoints aren’t mitigated at the Traefik layer — but Sentinel’s tarpit action and Cloudflare’s DDoS protection handle that upstream.

args:
- "--entrypoints.websecure.forwardedHeaders.trustedIPs=173.245.48.0/20,103.21.244.0/22,..."

All Cloudflare IPv4 and IPv6 ranges are listed as trusted IPs. This tells Traefik to trust X-Forwarded-For headers from these IPs, which is necessary because Cloudflare Tunnel connects from Cloudflare edge IPs. Without this, X-Forwarded-For would be stripped and the sentinel plugin would see the cloudflared pod IP instead of the real client IP.

env:
- name: GOMAXPROCS
value: "2"
- name: GOMEMLIMIT
value: "900MiB"

On ARM64 homelab nodes with 4 cores, limiting GOMAXPROCS to 2 prevents Traefik from consuming all CPU cores. GOMEMLIMIT at 900MiB (with a 1024Mi limit) gives the Go GC a soft target to aim for, reducing OOM kills from GC pressure spikes.

securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true

The root filesystem is read-only. Writable paths are provided via volume mounts: /ssl-certs-2 (PVC for ACME certs), /tmp (emptyDir), /plugins-local/ (ConfigMap mounts for plugins), /plugins-storage (emptyDir for remote plugin cache), /blocklists (ConfigMap for IPsum blocklist).

affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: traefik
topologyKey: kubernetes.io/hostname

With 2 replicas, the anti-affinity preference spreads them across different nodes. It’s preferred not required because on a 4-node cluster with other workloads, there might not always be two nodes available.

The PDB ensures at least 1 replica is always available during voluntary disruptions (node drains, rolling updates):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: traefik-pdb
namespace: traefik
spec:
minAvailable: 1
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 10"]

The 10-second pre-stop sleep gives the Service endpoints time to de-register from kube-proxy before the pod starts shutting down. Without this, in-flight requests can hit a pod that’s already draining.


Traefik supports two types of plugins: remote (fetched from GitHub on startup) and local (mounted from the filesystem). Local plugins use Traefik’s Yaegi Go interpreter — you write standard Go code, and Traefik interprets it at runtime. No compilation step needed.

  1. Plugin source goes into /plugins-local/src/<moduleName>/ inside the Traefik container
  2. The module must have go.mod, .traefik.yml, and the Go source file
  3. Traefik is told about the plugin via --experimental.localPlugins.<name>.moduleName=<moduleName>
  4. A Middleware CRD references the plugin by name under spec.plugin.<name>

Since Traefik runs with readOnlyRootFilesystem: true, the plugin files are packaged as ConfigMaps and mounted as volumes.

Plugin 1: Sentinel (bot detection + IP resolution + IPsum blocklist + rule engine)

Section titled “Plugin 1: Sentinel (bot detection + IP resolution + IPsum blocklist + rule engine)”

Sentinel is a ~1843-line Yaegi local plugin that provides the entire inline security layer. It replaces the standalone realclientip plugin and the previously-used CrowdSec Bouncer remote plugin, combining IP resolution, heuristic bot detection, IPsum threat intelligence blocklist enforcement, and a Cloudflare WAF-inspired expression-based firewall rule engine into a single middleware.

8-step request flow:

  1. IP Resolution: Resolve real client IP from trusted headers (Cf-Connecting-Ip > XFF right-to-left > RemoteAddr), set X-Real-Client-Ip header
  2. GeoIP Country Resolution: Check Cf-Ipcountry header first, fall back to GeoIP MMDB lookup (DB-IP free country database), set X-Geo-Country header
  3. Allowlist Check: If IP in allowedIPs config → pass immediately (skip all checks)
  4. IPsum Blocklist Check: 19,621+ IPs loaded from /blocklists/ipsum.txt (CronJob refreshes daily). If IP matched → 403 Forbidden with X-Blocked-By: sentinel-blocklist
  5. Heuristic Bot Scoring: 9 signals accumulate a score per request
  6. Rule Engine: Expression-based firewall rules evaluated top-to-bottom by priority. First terminating action (allow/block/tarpit) wins; non-terminating actions (score/log/tag) accumulate
  7. Threshold Check: If cumulative score >= blockThreshold (100) → 403 Forbidden
  8. Response Intercept: Wraps upstream responses to style error pages with block info

Scoring signals:

SignalScoreRationale
Scanner UA substring match+100sqlmap, nikto, nuclei, zgrab, etc. — one match is enough to block
Honeypot path match+100/.env, /.git/HEAD, /wp-login.php, etc. — no legitimate client requests these
Empty User-Agent+40Most real browsers always send UA
Missing Accept header+30Browsers always send Accept
HTTP/1.0 protocol+25Almost no modern client uses HTTP/1.0
Missing Accept-Language+20Browsers send this; most bots don’t
Missing Accept-Encoding+15Browsers send this
Connection: close with HTTP/1.1+10Unusual for real clients
Per-IP rate exceeded (>30 req/s)+30Sliding window rate tracker per IP

A request with a known scanner UA (+100) gets blocked immediately. A request with no UA (+40), no Accept (+30), and no Accept-Language (+20) also gets blocked (90 total, but add missing Accept-Encoding at +15 = 105 >= 100). The per-IP rate tracker uses a sliding window with background cleanup to prevent memory leaks.

Rule engine (expression-based firewall):

The rule engine uses a concise expression syntax with short field names, a recursive descent parser, and a tokenizer — all in pure Go stdlib:

path contains "/admin" and country eq "CN"
(ip in {1.2.3.4 5.6.7.8/24}) or (ua matches "^curl/")
not ip in {10.0.0.0/8} and score ge 80
host eq "logpush-k3s.erfi.io" and not header["X-Logpush-Secret"] eq "..."

Available fields (short names preferred, long CF-style names still work for backward compat):

FieldLong aliasSource
ipip.srcResolved client IP
countryip.src.countryCf-Ipcountry header, falls back to GeoIP MMDB (DB-IP free country database)
hosthttp.hostHost header
methodhttp.request.methodRequest method
pathhttp.request.uri.pathURI path
queryhttp.request.uri.queryQuery string
urihttp.request.uriFull URI (path + query)
uahttp.user_agentUser-Agent header
header["X"]http.request.headers["X"]Any header by name
sslBoolean (TLS)
scoresentinel.scoreComputed bot score
protoHTTP protocol version

Operators: eq, ne, contains, matches (regex), in {set} (IP/CIDR/string), gt, ge, lt, le. Logical: and, or, not, parentheses.

Actions: allow (bypass all, terminates), block (403, terminates), tarpit (slow-drip chunked response, 2s intervals, 5min max, terminates), score:N (add N to bot score, continues), log (log only, continues), tag:name (add header tag, continues).

Rules are stored as a JSON array string in the middleware CRD rules field. Example deployed rules:

IDPriorityExpressionAction
r11ip eq "195.240.81.42"allow (owner IP bypass)
r62host eq "logpush-k3s.erfi.io" and not header["X-Logpush-Secret"] eq "..."block (deny without secret)
r210path contains "/.git" and not ip eq "195.240.81.42"block
r320country in {CN RU}score:30
r430ua matches "^curl/" and header["Accept"] eq ""block
r5100score ge 150tarpit

The Security Dashboard provides a guided expression builder UI for creating rules: field dropdown (12 fields including Protocol), operator dropdown (dynamic per field type), value input, AND/OR combinator, NOT toggle per condition, nested condition groups for mixed AND/OR logic (e.g., (a and b) or (c and d)), and condition chips with remove buttons. The builder auto-generates the expression string and supports bidirectional sync (editing existing rules reverse-parses expressions back into the builder, including groups and negated conditions).

Implementation constraints (Yaegi runtime):

  • Pure Go stdlib only — no external dependencies, no cgo, no unsafe
  • Cannot use html/template — uses manual string building for HTML error pages
  • Cannot use Go interfaces for method dispatch — Yaegi panics with reflect: call of reflect.Value.SetBool on interface Value. The AST uses a single ExprNode struct with exprKind type tag and standalone evalExpr() function instead of an Expr interface with concrete types
  • Returns (string, int, bool, error) tuple from eval instead of interface{} to avoid Yaegi reflection issues
  • Manual JSON parser for rules (no encoding/json dependency for Yaegi safety)

IPsum blocklist:

IPsum is an open threat intelligence feed aggregating 10+ blocklist sources. A CronJob runs daily, downloads the latest list, and stores it in a ConfigMap mounted into Traefik at /blocklists/ipsum.txt. The plugin loads the blocklist into an in-memory map on startup and reloads periodically (configurable via blocklistReloadSeconds, default 300s). Currently 19,621+ IPs loaded.

The CronJob resources live in services/sentinel/ipsum-cronjob.yaml: ServiceAccount, Role, RoleBinding (ConfigMap write access in traefik namespace), Python script ConfigMap, and the CronJob itself.

GeoIP country lookup:

Sentinel includes a pure Go MMDB reader (~300 lines, stdlib only) for resolving client IPs to ISO country codes without Cloudflare. Resolution order: Cf-Ipcountry header first, then GeoIP MMDB fallback. The resolved country is set as the X-Geo-Country request header on every request (visible in access logs and Loki structured metadata).

The database is DB-IP free country (dbip-country-lite-YYYY-MM.mmdb.gz, ~7MB), downloaded by an init container on pod start and stored in an emptyDir volume at /geoip/country.mmdb. Configure via geoipFile in the middleware CRD. The MMDB reader supports 24/28/32-bit record sizes, IPv4-in-IPv6 subtree caching, and is compatible with both MaxMind GeoLite2 and DB-IP formats.

Packaging as ConfigMap:

The plugin source, go.mod, and .traefik.yml are inlined in a ConfigMap:

# The ConfigMap is generated from middleware/sentinel.go
apiVersion: v1
kind: ConfigMap
metadata:
name: traefik-plugin-sentinel
namespace: traefik
data:
sentinel.go: |
package sentinel
// ... (full Go source, ~1843 lines)
go.mod: |
module github.com/erfianugrah/sentinel
go 1.22
.traefik.yml: |
displayName: Sentinel
type: middleware
import: github.com/erfianugrah/sentinel
summary: Real client IP resolution + heuristic bot detection + IPsum blocklist + expression-based firewall rules.
testData:
trustedHeaders:
- Cf-Connecting-Ip
- X-Forwarded-For
# ...

Mounted in the Deployment:

volumeMounts:
- name: plugin-sentinel
mountPath: /plugins-local/src/github.com/erfianugrah/sentinel
readOnly: true
volumes:
- name: plugin-sentinel
configMap:
name: traefik-plugin-sentinel

Enabled via args:

args:
- "--experimental.localPlugins.sentinel.moduleName=github.com/erfianugrah/sentinel"

Middleware CRD (applied as a global middleware on the websecure entrypoint):

middleware/sentinel-middleware.yaml
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: sentinel
namespace: traefik
spec:
plugin:
sentinel:
trustedHeaders:
- Cf-Connecting-Ip
- X-Forwarded-For
trustedProxies:
- "10.42.0.0/16" # k3s pod CIDR
- "10.43.0.0/16" # k3s service CIDR
- "173.245.48.0/20" # Cloudflare IPv4
# ... all CF ranges
enabled: true
blockThreshold: 100
tagThreshold: 60
rateLimitPerSecond: 30
rateLimitWindowSeconds: 10
blocklistFile: "/blocklists/ipsum.txt"
blocklistReloadSeconds: 300
allowedIPs: "195.240.81.42"
scannerUAs: "sqlmap,nikto,dirbuster,masscan,zgrab,nuclei,httpx,gobuster,ffuf,nmap,whatweb,wpscan,joomla,drupal"
honeypotPaths: "/.env,/.git/HEAD,/.git/config,/wp-login.php,/wp-config.php,/wp-admin,/.aws/credentials,/actuator/env,/actuator/health,/xmlrpc.php,/.DS_Store,/config.json,/package.json,/.htaccess,/server-status,/debug/pprof"
rules: |
[
{"id":"r1","description":"Allow owner IP","expression":"ip.src eq \"195.240.81.42\"","action":"allow","enabled":true,"priority":1},
...
]

The rules field is a JSON array string. Legacy fields (allowedIPs, honeypotPaths, scannerUAs) still work as backward-compatible shortcuts alongside the rule engine.

The decompress plugin exists for one reason: Cloudflare Logpush always gzip-compresses HTTP payloads, and Alloy’s /loki/api/v1/raw endpoint doesn’t handle Content-Encoding: gzip. Traefik’s built-in compress middleware only handles response compression, not request body decompression.

The plugin is simple — 71 lines of Go:

func (d *Decompress) ServeHTTP(rw http.ResponseWriter, req *http.Request) {
encoding := strings.ToLower(req.Header.Get("Content-Encoding"))
if encoding != "gzip" {
d.next.ServeHTTP(rw, req)
return
}
gzReader, err := gzip.NewReader(req.Body)
if err != nil {
http.Error(rw, fmt.Sprintf("failed to create gzip reader: %v", err), http.StatusBadRequest)
return
}
defer gzReader.Close()
decompressed, err := io.ReadAll(gzReader)
if err != nil {
http.Error(rw, fmt.Sprintf("failed to decompress body: %v", err), http.StatusBadRequest)
return
}
req.Body = io.NopCloser(bytes.NewReader(decompressed))
req.ContentLength = int64(len(decompressed))
req.Header.Set("Content-Length", strconv.Itoa(len(decompressed)))
req.Header.Del("Content-Encoding")
d.next.ServeHTTP(rw, req)
}

Same ConfigMap packaging pattern as sentinel. The decompress middleware CRD lives in the monitoring namespace (same as the Alloy Logpush IngressRoute that uses it):

middleware/decompress-middleware.yaml
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: decompress
namespace: monitoring
spec:
plugin:
decompress: {}

Published at github.com/erfianugrah/decompress.


Two middlewares are applied globally to every request on the websecure entrypoint via the --entrypoints.websecure.http.middlewares flag:

- "--entrypoints.websecure.http.middlewares=traefik-sentinel@kubernetescrd,traefik-security-headers@kubernetescrd"

The format is <namespace>-<name>@kubernetescrd. Order matters — sentinel runs first (resolves IP, checks blocklist, scores request, evaluates rules), then security-headers adds HSTS and other response headers.

middleware/security-headers.yaml
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: security-headers
namespace: traefik
spec:
headers:
stsSeconds: 63072000 # HSTS 2 years
stsIncludeSubdomains: true
stsPreload: true
contentTypeNosniff: true
referrerPolicy: "strict-origin-when-cross-origin"
permissionsPolicy: "camera=(), microphone=(), geolocation=(), payment=()"
customResponseHeaders:
Server: "" # Strip server identity
X-Powered-By: ""

frameDeny, browserXssFilter, and CSP are intentionally omitted from the global middleware. These are app-specific — Authentik needs its own CSP, Grafana needs iframe support for embedding, etc. Apply those per-route where needed.


middleware/tls-options.yaml
apiVersion: traefik.io/v1alpha1
kind: TLSOption
metadata:
name: default
namespace: default
spec:
minVersion: VersionTLS12
maxVersion: VersionTLS13
cipherSuites:
# TLS 1.2 only -- TLS 1.3 ciphers are not configurable in Go (all safe by default)
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
- TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
curvePreferences:
- X25519
- CurveP256
sniStrict: true
alpnProtocols:
- h2
- http/1.1

The TLSOption must be named default in the default namespace for Traefik to pick it up as the default TLS configuration. All cipher suites are AEAD-only (GCM or ChaCha20-Poly1305) — no CBC mode. sniStrict: true rejects connections that don’t present a valid SNI hostname matching a known route.

args:
- "--certificatesresolvers.cloudflare.acme.dnschallenge.provider=cloudflare"
- "--certificatesresolvers.cloudflare.acme.email=erfi.anugrah@gmail.com"
- "--certificatesresolvers.cloudflare.acme.dnschallenge.resolvers=1.1.1.1"
- "--certificatesresolvers.cloudflare.acme.storage=/ssl-certs-2/acme-cloudflare.json"

The CF_DNS_API_TOKEN env var is pulled from a Kubernetes Secret (cloudflare-credentials). The ACME cert storage lives on an NFS PVC (traefik-ssl-2, 2Gi, RWX) so certs survive pod restarts and don’t trigger Let’s Encrypt rate limits on every rollout.

pvc-claims/traefik-ssl-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: traefik-ssl-2
namespace: traefik
spec:
accessModes: [ReadWriteMany]
resources:
requests:
storage: 2Gi
storageClassName: nfs-client

Each service gets its own rate limit middleware to prevent cross-service token bucket interference. The problem this solves: when multiple services share a single rate-limit-api middleware, Traefik maintains one token bucket per source IP per middleware instance. All routes sharing that middleware share the same bucket. Authentik OAuth flows generate 35+ requests in bursts (redirects, consent, callback, static assets), which would exceed a shared 10 req/s bucket and return 429s.

All per-route rate limit middlewares live in a single file:

# middleware/rate-limits.yaml (pattern -- 22 middlewares total)
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: rl-authentik
namespace: traefik
spec:
rateLimit:
average: 100
period: 1s
burst: 500
sourceCriterion:
requestHeaderName: X-Real-Client-Ip

sourceCriterion.requestHeaderName: X-Real-Client-Ip uses the header set by the sentinel plugin for per-IP bucketing. Without this, Traefik would use the connection source IP, which behind Cloudflare Tunnel is always the cloudflared pod IP — meaning all users would share one bucket.

Rate limits for monitoring/query services (Grafana, Prometheus, Alertmanager, Jaeger, Logpush, Traefik Dashboard, Traefik Prometheus) are currently commented out in their IngressRoutes. These services generate heavy internal query traffic (Grafana fires dozens of parallel Loki queries when loading dashboards), and rate limiting them causes query timeouts.

Managing rate limits via the Security Dashboard:

The Security Dashboard’s Rate Limits page provides a web UI for managing all 22 rl-* middleware CRDs without kubectl:

  • Inline editing: click any value (average, burst, period) in the table to edit it in-place. Saves are instant via kubectl patch (strategic merge patch) on the middleware CRD
  • Create: modal form to create a new rl-{name} middleware CRD with configurable average, burst, period, and source criterion
  • Delete: removes the middleware CRD entirely (with confirmation dialog)

The dashboard’s ClusterRole has get, list, watch, patch, update, create, delete permissions for middlewares in the traefik.io API group.

Terminal window
# Equivalent kubectl commands for reference
kubectl get middlewares.traefik.io -n traefik -l app!=sentinel | grep "^rl-"
kubectl patch middleware rl-grafana -n traefik --type merge \
-p '{"spec":{"rateLimit":{"average":200,"burst":1000}}}'
middleware/inflight-req.yaml
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: inflight-req
namespace: traefik
spec:
inFlightReq:
amount: 100
sourceCriterion:
requestHeaderName: X-Real-Client-Ip

Limits concurrent connections per source IP to 100. Unlike rate limiting (which controls request rate), this controls concurrency. A single IP can’t monopolize all backend connections. Shared across all routes — this is fine because the limit is per-IP, not per-route.

middleware/retry.yaml
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: retry
namespace: traefik
spec:
retry:
attempts: 3
initialInterval: 100ms

3 attempts total (1 initial + 2 retries) with exponential backoff starting at 100ms. Only retries on connection errors, NOT on non-2xx status codes. Also shared across all routes.

middleware/authentik-forward-auth.yaml
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: authentik-forward-auth
namespace: authentik
spec:
forwardAuth:
address: http://authentik-server.authentik.svc.cluster.local/outpost.goauthentik.io/auth/traefik
trustForwardHeader: true
authResponseHeaders:
- X-authentik-username
- X-authentik-groups
- X-authentik-entitlements
- X-authentik-email
- X-authentik-name
- X-authentik-uid
- X-authentik-jwt
- X-authentik-meta-jwks
- X-authentik-meta-outpost
- X-authentik-meta-provider
- X-authentik-meta-app
- X-authentik-meta-version

Applied per-route to services that need SSO protection (Jaeger UI, etc.). Traefik forwards a sub-request to Authentik’s embedded outpost; if Authentik returns 200, the original request proceeds with the X-authentik-* headers injected. If 401/403, the user is redirected to the Authentik login flow.


20+ IngressRoutes route traffic from hostnames to backend services. Each IngressRoute specifies its middleware chain. The middleware execution order is: global middlewares first (sentinel → security-headers), then per-route middlewares in the order listed.

RouteHostMiddlewaresNamespace
Grafanagrafana-k3s.example.comrl-grafana, inflight-req, retrymonitoring
Prometheusprom-k3s.example.comrl-prometheus, inflight-req, retrymonitoring
Alertmanageralertmanager-k3s.example.comrl-alertmanager, inflight-req, retrymonitoring
Jaegerjaeger-k3s.example.comrl-jaeger, inflight-req, authentik-forward-auth, retrymonitoring
Logpushlogpush-k3s.example.comrl-logpush, inflight-req, decompress, retrymonitoring
Traefik Dashboardtraefik-dashboard.example.comrl-traefik-dashboard, inflight-req, retrytraefik
Traefik Prometheustraefik-prometheus.example.comrl-traefik-prometheus, inflight-req, retrytraefik
Authentikauthentik.example.comrl-authentik, inflight-req, authentik-csp, retryauthentik
Revistamydomain.comrl-revista, inflight-req, retryrevista
ArgoCD (HTTP)argocd.example.comrl-argocd, inflight-req, retryargocd
ArgoCD (gRPC)argocd.example.com + gRPC headerrl-argocd, inflight-req, retryargocd
Dendritedendrite.example.comrl-dendrite, inflight-req, retrydendrite
httpbunhttpbun-k3s.example.comrl-httpbun, inflight-req, retryhttpbun
Jitsi (from Element)jitsi.example.com + Referer matchrl-jitsi, inflight-req, retryjitsi
Jitsi (direct)jitsi.example.comrl-jitsi, inflight-req, retryjitsi
LiveKit JWTmatrix-rtc.example.com/livekit/jwtrl-livekit, inflight-req, strip-livekit-jwt, retrylivekit
LiveKit SFUmatrix-rtc.example.com/livekit/sfurl-livekit, inflight-req, strip-livekit-sfu, retrylivekit
Element (chat)chat.example.comrl-matrix-element, inflight-req, retrymatrix
Synapse Adminadmin.matrix.example.comrl-matrix-admin, inflight-req, retrymatrix
Synapsematrix.example.comrl-matrix-synapse, inflight-req, retrymatrix
Maubotmaubot.example.comrl-maubot, inflight-req, retrymaubot
Headlampheadlamp-k3s.example.comrl-headlamp, inflight-req, retryheadlamp
Longhornlonghorn.example.comrl-longhorn, inflight-req, retrylonghorn-system
Portainerportainer-k3s.example.comrl-portainer, inflight-req, retryportainer
Portainer Agentport-agent-k3s.example.comrl-portainer-agent, inflight-req, retryportainer
Security Dashboardsecurity-k3s.example.comauthentik-forward-authsecurity-dashboard

Strikethrough (~~) indicates rate limits that are currently commented out.


All external traffic enters the cluster through a Cloudflare Tunnel. The tunnel connects from a cloudflared Deployment inside the cluster to Cloudflare’s edge network via outbound QUIC connections — no inbound ports or public IPs needed.

  1. cloudflared runs in the cloudflared namespace, maintains 4 HA connections to Cloudflare edge
  2. DNS CNAME records point each hostname to the tunnel’s .cfargotunnel.com address
  3. Cloudflare edge receives the request, looks up the tunnel config, and forwards to cloudflared
  4. cloudflared routes to the Traefik Service based on hostname matching in the tunnel ingress rules
  5. Traefik handles TLS termination, middleware, and routing to the backend

Each hostname maps to the Traefik Service’s cluster-internal HTTPS endpoint:

cloudflare-tunnel-tf/tunnel_config.tf
ingress_rule {
hostname = "grafana-k3s.${var.secondary_domain_name}"
service = "https://traefik.traefik.svc.cluster.local"
origin_request {
origin_server_name = "grafana-k3s.${var.secondary_domain_name}"
http2_origin = true
}
}

origin_server_name is set to the actual hostname so cloudflared presents the correct SNI to Traefik. http2_origin = true enables HTTP/2 between cloudflared and Traefik, which is needed for gRPC (ArgoCD) and improves multiplexing.

Each service that needs external access gets its own ingress rule. For example, the security dashboard:

ingress_rule {
hostname = "security-k3s.${var.secondary_domain_name}"
service = "http://security-dashboard.security-dashboard.svc.cluster.local"
origin_request {
origin_server_name = "security-k3s.${var.secondary_domain_name}"
http2_origin = true
}
}

The catch-all rule at the bottom returns 404 for unrecognized hostnames:

ingress_rule {
service = "http_status:404"
}

Each service gets a CNAME record pointing to the tunnel:

cloudflare-tunnel-tf/records.tf
resource "cloudflare_record" "grafana-k3s" {
zone_id = var.cloudflare_secondary_zone_id
name = "grafana-k3s"
type = "CNAME"
content = cloudflare_zero_trust_tunnel_cloudflared.k3s.cname
proxied = true
tags = ["k3s", "monitoring"]
}

proxied = true routes traffic through Cloudflare’s edge (DDoS protection, WAF, caching). The CNAME target is the tunnel’s unique .cfargotunnel.com address, auto-generated by the cloudflare_zero_trust_tunnel_cloudflared resource.


Traefik uses a KEDA ScaledObject with 5 triggers for intelligent autoscaling between 1 and 8 replicas:

hpa/traefik-keda-autoscaling.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: traefik-keda
namespace: traefik
spec:
scaleTargetRef:
name: traefik
pollingInterval: 5
cooldownPeriod: 10
minReplicaCount: 1
maxReplicaCount: 8
triggers:
- type: cpu
metadata:
type: Utilization
value: "50"
- type: memory
metadata:
type: Utilization
value: "75"
- type: prometheus
metadata:
serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
metricName: traefik_open_connections
threshold: "1000"
query: sum(traefik_open_connections{entrypoint="websecure"})
- type: prometheus
metadata:
metricName: traefik_request_duration
threshold: "0.5"
query: histogram_quantile(0.95, sum(rate(traefik_entrypoint_request_duration_seconds_bucket{entrypoint="websecure"}[1m])) by (le))
- type: prometheus
metadata:
metricName: traefik_requests_total
threshold: "1000"
query: sum(rate(traefik_entrypoint_requests_total{entrypoint="websecure"}[1m]))

The Prometheus triggers query Traefik’s own metrics: open connection count, p95 request duration, and request rate. Any single trigger exceeding its threshold causes a scale-up. The 5-second polling interval and 10-second cooldown make it responsive without flapping.


Access logs → Loki (with structured metadata)

Section titled “Access logs → Loki (with structured metadata)”

Traefik writes JSON-formatted access logs to stdout:

args:
- "--accesslog=true"
- "--accesslog.format=json"
- "--accesslog.bufferingsize=100"
- "--accesslog.fields.defaultmode=keep"
- "--accesslog.fields.headers.defaultmode=keep"

The bufferingsize=100 buffers up to 100 log lines before flushing, reducing I/O pressure. fields.defaultmode=keep and fields.headers.defaultmode=keep include all fields and request/response headers in the JSON output — this is what enables the sentinel bot score, block reason, and other custom headers to appear in the access logs.

The Alloy DaemonSet picks up these logs from the Traefik container’s stdout (via /var/log/pods/), parses the JSON, and sends them to Loki with 19 structured metadata fields:

FieldSourcePurpose
statusDownstreamStatusHTTP response status code
downstream_statusDownstreamStatusSame (for compatibility)
routerRouterNameTraefik router that handled the request
serviceServiceNameBackend service
client_ipClientHostDirect connection source (usually cloudflared pod)
real_client_iprequest_X-Real-Client-IpActual client IP (set by sentinel)
bot_scorerequest_X-Bot-ScoreSentinel bot score
blocked_byrequest_X-Blocked-ByBlock source (sentinel-rule, sentinel-blocklist, etc.)
countryrequest_X-Geo-CountryClient country code (Cf-Ipcountry → GeoIP MMDB fallback)
cf_connecting_iprequest_Cf-Connecting-IpCloudflare’s client IP header
request_hostRequestHostHost header value
request_pathRequestPathURI path
request_protocolRequestProtocolHTTP/1.1, HTTP/2.0, etc.
durationDurationTotal request duration
origin_durationOriginDurationBackend response time
overheadOverheadTraefik processing overhead
downstream_sizeDownstreamContentSizeResponse body size
tls_versionTLSVersionTLS 1.2 or 1.3
user_agentrequest_User-AgentClient User-Agent

Labels (low cardinality): entrypoint, method, job="traefik-access-log"

Dashboards query these structured metadata fields directly instead of using | json full-line parsing, which is 5-10x faster (14ms vs 30s+ response times).

Additionally, Alloy generates 7 Prometheus counters via stage.metrics, categorized by Sentinel block type:

CounterMatch condition
loki_process_custom_traefik_access_requests_totalAll requests (match_all = true)
loki_process_custom_traefik_access_sentinel_blocks_totalblocked_by = "sentinel" (bot scoring threshold)
loki_process_custom_traefik_access_blocklist_blocks_totalblocked_by = "sentinel-blocklist" (IPsum blocklist)
loki_process_custom_traefik_access_ratelimit_blocks_totalblocked_by = "rate-limit" (per-IP rate limit)
loki_process_custom_traefik_access_sentinel_rule_blocks_totalblocked_by = "sentinel-rule" (firewall rule engine)
loki_process_custom_traefik_access_tarpit_blocks_totalblocked_by = "sentinel-tarpit" (tarpit action)
loki_process_custom_traefik_access_403_totaldownstream_status = "403" (all 403s regardless of source)

The source field in stage.metrics reads from the extracted data map populated by stage.json, matching the blocked_by and downstream_status JSON keys. These counters power the Security Dashboard’s instant-loading aggregate statistics and the Grafana Traefik Access Logs dashboard’s security section without querying Loki.

See the monitoring stack guide for the full Alloy config and the label/metadata split.

args:
- "--tracing.otlp=true"
- "--tracing.otlp.grpc=true"
- "--tracing.otlp.grpc.endpoint=alloy.monitoring.svc.cluster.local:4317"
- "--tracing.otlp.grpc.insecure=true"
- "--tracing.serviceName=traefik"
- "--tracing.sampleRate=1.0"

Traefik sends OTLP traces to the Alloy DaemonSet on each node, which batches and forwards them to Jaeger. 100% sample rate is fine for a homelab — in production you’d want to sample down.

args:
- "--metrics.prometheus=true"
- "--metrics.prometheus.entrypoint=metrics"
- "--metrics.prometheus.addrouterslabels=true"

addrouterslabels=true adds a router label to all metrics, enabling per-IngressRoute dashboards and alerting. The metrics endpoint is scraped by a ServiceMonitor in the monitoring stack.


Terminal window
# 1. Disable built-in Traefik (one-time)
ansible-playbook -i inventory.yml \
ansible-playbooks/my-playbooks/disable-builtin-traefik.yml \
--become --ask-become-pass
# 2. Apply CRDs
kubectl apply -f crds/kubernetes-crd-definition-v1.yml --server-side
kubectl apply -f crds/kubernetes-crd-rbac.yml
# 3. Apply TLS options
kubectl apply -f middleware/tls-options.yaml
# 4. Deploy plugin ConfigMaps
kubectl apply -f middleware/decompress-configmap.yaml
kubectl apply -f middleware/sentinel-configmap.yaml
# 5. Deploy middleware CRDs
kubectl apply -f middleware/sentinel-middleware.yaml
kubectl apply -f middleware/security-headers.yaml
kubectl apply -f middleware/decompress-middleware.yaml
kubectl apply -f middleware/authentik-forward-auth.yaml
kubectl apply -f middleware/inflight-req.yaml
kubectl apply -f middleware/retry.yaml
kubectl apply -f middleware/rate-limits.yaml
# 6. Deploy IPsum blocklist CronJob
kubectl apply -f services/sentinel/ipsum-cronjob.yaml
# 7. Deploy Traefik (includes SA, ClusterRole, Service, Deployment, IngressClass, PDB)
kubectl apply -f services/traefik.yaml
# 8. Deploy IngressRoutes
kubectl apply -f ingressroutes/
# 9. Deploy KEDA autoscaling
kubectl apply -f hpa/traefik-keda-autoscaling.yaml
# 10. Apply DNS + tunnel config (OpenTofu)
cd cloudflare-tunnel-tf/ && tofu apply
Terminal window
# Traefik pods running on different nodes
kubectl get pods -n traefik -o wide
# All middlewares loaded
kubectl get middlewares.traefik.io -n traefik
# TLS option active
kubectl get tlsoptions.traefik.io -A
# IngressRoutes across all namespaces
kubectl get ingressroutes.traefik.io -A
# KEDA ScaledObject active
kubectl get scaledobject -n traefik
# Test bot detection (should return 403)
curl -s -o /dev/null -w "%{http_code}" -H "User-Agent: sqlmap/1.0" https://httpbun-k3s.example.com/
# Test honeypot path (should return 403)
curl -s -o /dev/null -w "%{http_code}" https://httpbun-k3s.example.com/.env
# Test rule engine - .git block (should return 403 with X-Rule-Match: r2)
curl -s -D - https://httpbun-k3s.example.com/.git/config 2>&1 | grep -i "x-blocked-by\|x-rule-match\|http/"
# Check IPsum blocklist loaded
kubectl logs -n traefik -l app.kubernetes.io/name=traefik --tail=50 | grep -i "blocklist\|ipsum"

Sentinel is the sole inline security layer. All blocking, scoring, and rule evaluation happens here.

A dedicated Go+htmx web application at security-k3s.example.com provides a browser-based interface for managing Sentinel configuration and viewing security analytics. Protected by Authentik forward-auth.

FeatureDashboardDirect config
View aggregate stats (requests, blocks, errors)Yes (Prometheus, instant)N/A
View recent blocks with detailsYes (background Loki worker)kubectl logs
View bot score distributionYes (chart)N/A
Manage firewall rules (CRUD, reorder)Yes (modal editor, drag-to-reorder)Edit middleware YAML
Manage detection rules (honeypots, scanner UAs)YesEdit middleware YAML
Manage allowlist (add/remove IPs)YesEdit middleware YAML
Check IP against blocklistYesN/A
Manage rate limits (CRUD)Yes (inline edit, create/delete modals)kubectl patch/create/delete
Trigger blocklist reloadYesRestart Traefik
IP lookup (all access logs for an IP)Yes (Loki query)LogQL

Source: services/security-dashboard/. Deploy: docker build --platform linux/arm64 -t erfianugrah/security-dashboard:latest .docker pushkubectl rollout restart deployment/security-dashboard -n security-dashboard. See security-stack.md for full architecture details.

Rules can be managed via the Security Dashboard’s Policy Engine page or by editing the middleware CRD directly:

Terminal window
# View current rules
kubectl get middleware sentinel -n traefik -o jsonpath='{.spec.plugin.sentinel.rules}' | python3 -m json.tool
# Edit rules directly (careful -- JSON in YAML)
kubectl edit middleware sentinel -n traefik

The Dashboard’s Policy Engine page is preferred — it provides a modal editor with field reference, expression validation, drag-to-reorder priority, and a Deploy button that applies changes atomically.

If services behind Traefik return 403 Forbidden unexpectedly, check in this order:

  1. Sentinel rule engine — check response headers X-Blocked-By and X-Rule-Match to identify which rule blocked the request
  2. Sentinel bot scoring — check X-Bot-Score header. Score >= 100 triggers a block. Review heuristic signals
  3. IPsum blocklist — is the client IP in the blocklist? Check via the Security Dashboard’s Blocklist page
  4. Cloudflare WAF — check the Cloudflare dashboard for firewall events (these happen before traffic reaches Traefik)

The global middleware chain on the websecure entrypoint is: sentinel -> security-headers. The X-Blocked-By header distinguishes block sources:

X-Blocked-By valueSourceFix
sentinel-ruleRule engine matched (check X-Rule-Match for rule ID)Edit/disable the rule
sentinel-blocklistIP in IPsum blocklistAdd IP to allowedIPs or add an allow rule
sentinel-heuristicBot score exceeded thresholdAdd IP to allowedIPs or add an allow rule
sentinel-ratePer-IP rate limit exceededIncrease rateLimitPerSecond or add an allow rule

Common false positive: Cloudflare Logpush. Logpush sends requests without browser headers, triggering bot heuristics. Fix: add a rule matching the X-Logpush-Secret header with allow action (already deployed as rule r6).

The blocklist is refreshed daily by a CronJob. To force a reload:

Terminal window
# Trigger manual CronJob run
kubectl create job --from=cronjob/ipsum-update ipsum-manual -n sentinel
# Or restart Traefik (blocklist reloads on startup)
kubectl rollout restart deployment/traefik -n traefik

The blocklist ConfigMap is in the traefik namespace. Sentinel reloads it in-memory every blocklistReloadSeconds (default 300s) without requiring a Traefik restart.

Country data is resolved on every request and set as X-Geo-Country. When Cloudflare is in the path, Cf-Ipcountry is used directly. For non-CF traffic (e.g., direct tunnel access or if CF is removed), the GeoIP MMDB lookup provides country data.

To verify GeoIP is working:

Terminal window
# Check Traefik logs for GeoIP database load
kubectl logs -n traefik -l app.kubernetes.io/name=traefik | grep "GeoIP"
# Expected: [sentinel] GeoIP database loaded: 1189588 nodes, IPv6
# Verify X-Geo-Country header in access logs
kubectl logs -n traefik -l app.kubernetes.io/name=traefik --tail=5 | grep -o '"request_X-Geo-Country":"[^"]*"'

The DB-IP database is refreshed monthly. To force a re-download, restart the Traefik deployment (the init container runs on each pod start).


services/
traefik.yaml # SA, ClusterRole, Service, Deployment, IngressClass, PDB
crds/
kubernetes-crd-definition-v1.yml # Traefik CRDs (~3.5 MB)
kubernetes-crd-rbac.yml # ClusterRole for CRD provider
middleware/
# Local plugins (source + ConfigMap + Middleware CRD)
sentinel.go # ~1843-line Go source (IP + bot + blocklist + rule engine)
sentinel-middleware.yaml # Middleware CRD with config (thresholds, rules, blocklist, allowlist)
decompress-plugin/
decompress.go # 71-line Go source
go.mod / .traefik.yml
decompress-configmap.yaml # ConfigMap packaging for k8s
decompress-middleware.yaml # Middleware CRD (in monitoring ns)
# Global middlewares
security-headers.yaml # HSTS, nosniff, permissions policy
tls-options.yaml # TLSOption (min TLS 1.2, AEAD ciphers, sniStrict)
# Shared per-route middlewares
rate-limits.yaml # 22 per-route rate limit middlewares (rl-*)
inflight-req.yaml # 100 concurrent req/IP
retry.yaml # 3 attempts, 100ms backoff
# Auth
authentik-forward-auth.yaml # Forward auth to Authentik (in authentik ns)
services/sentinel/
ipsum-cronjob.yaml # SA, Role, RoleBinding, Python script ConfigMap, CronJob
ingressroutes/
alertmanager-ingress.yaml # monitoring
alloy-logpush-ingress.yaml # monitoring (+ decompress middleware)
argocd-ingress.yaml # argocd (2 routes: HTTP + gRPC)
authentik-ingress.yaml # authentik
dendrite-ingress.yaml # dendrite
grafana-ingress.yaml # monitoring
httpbun-ingress.yaml # httpbun
jaeger-ingress.yaml # monitoring (+ authentik-forward-auth)
longhorn-ingress.yaml # longhorn-system
portainer-agent-ingress.yaml # portainer
portainer-ingress.yaml # portainer
prometheus-ingress.yaml # monitoring
revista-ingress.yaml # revista
traefik-dashboard-ingress.yaml # traefik (api@internal)
traefik-prometheus-ingress.yaml # traefik (prometheus@internal)
services/*/ingress.yaml # Service-embedded IngressRoutes
security-dashboard/manifests.yaml # SA, RBAC, Secret, Deployment, Service, IngressRoute
headlamp/ingressroute.yaml
jitsi/ingress.yaml # 2 routes: Referer-gated + direct
livekit/ingress.yaml # 2 routes + stripPrefix middlewares
matrix/ingress.yaml # 3 routes: Element, Synapse Admin, Synapse
maubot/ingress.yaml
tests/
sentinel-e2e.sh # 15-test E2E suite (scanner UA, honeypots, rules, headers)
services/security-dashboard/
main.go # Go+htmx dashboard (~2700+ lines, zero deps)
manifests.yaml # SOPS-encrypted (ns, SA, RBAC, secret, deploy, svc, ingress)
Dockerfile # Multi-stage ARM64 build
ui/ # Templates + static assets (go:embed)
monitoring/
alloy/configmap.yaml # 19 structured metadata fields, 7 Prometheus counters
loki/configmap.yaml # gRPC 16MB max, split_queries 1h
grafana/dashboards/
traefik-access-logs.json # 37 panels, Prometheus + Loki, Sentinel security section
hpa/
traefik-keda-autoscaling.yaml # 5-trigger ScaledObject (1-8 replicas)
pvc-claims/
traefik-ssl-pvc.yaml # 2Gi NFS PVC for ACME cert storage
cloudflare-tunnel-tf/
tunnel_config.tf # Tunnel ingress rules (hostname → Traefik)
records.tf # DNS CNAME records → tunnel
ansible-playbooks/my-playbooks/
disable-builtin-traefik.yml # Disables k3s built-in Traefik + ServiceLB