Full Observability Stack on k3s: Prometheus, Loki, Jaeger, Grafana, and Cloudflare Logpush
A complete guide to building a full observability stack on a 4-node ARM64 k3s homelab cluster. No Helm — everything is raw Kustomize manifests. The stack covers metrics (Prometheus + Alertmanager), logging (Loki + Alloy), tracing (Jaeger with spanmetrics), and visualization (Grafana with 16 dashboards). On top of the standard LGTM stack, Cloudflare Logpush feeds HTTP request logs, firewall events, and Workers traces through a custom Traefik decompression plugin into Loki for security analytics and performance monitoring. Traefik access logs are enriched with structured metadata (bot scores, client IPs, TLS versions) via Alloy for a dedicated access log dashboard.
The guide is structured as a linear build-up: Prometheus Operator and core metrics first, then Loki and log collection, then Jaeger tracing, then Grafana with dashboards and SSO, then the Cloudflare Logpush pipeline with its custom Traefik plugin. Each section includes the actual manifests used in production.
Architecture Overview
Section titled “Architecture Overview”The cluster runs on 4x ARM64 Rock boards (rock1-rock4) on a private LAN behind a VyOS router with a PPPoE WAN link. All HTTP traffic enters via Cloudflare Tunnel through Traefik. The monitoring stack runs entirely in the monitoring namespace.
Component versions
Section titled “Component versions”| Component | Version | Image |
|---|---|---|
| Prometheus Operator | v0.89.0 | quay.io/prometheus-operator/prometheus-operator:v0.89.0 |
| Prometheus | v3.9.1 | quay.io/prometheus/prometheus:v3.9.1 |
| Alertmanager | v0.31.1 | quay.io/prometheus/alertmanager:v0.31.1 |
| kube-state-metrics | v2.18.0 | registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.18.0 |
| Node Exporter | v1.10.2 | quay.io/prometheus/node-exporter:v1.10.2 |
| Blackbox Exporter | v0.28.0 | quay.io/prometheus/blackbox-exporter:v0.28.0 |
| Grafana | 12.3.3 | docker.io/grafana/grafana:12.3.3 |
| k8s-sidecar | 2.5.0 | quay.io/kiwigrid/k8s-sidecar:2.5.0 |
| Loki | 3.6.5 | docker.io/grafana/loki:3.6.5 |
| Grafana Alloy | v1.13.1 | docker.io/grafana/alloy:v1.13.1 |
| Jaeger | 2.15.1 | docker.io/jaegertracing/jaeger:2.15.1 |
External access
Section titled “External access”All monitoring UIs are exposed via Cloudflare Tunnel through Traefik IngressRoutes:
| Service | URL | IngressRoute |
|---|---|---|
| Grafana | https://grafana-k3s.example.io | ingressroutes/grafana-ingress.yaml |
| Prometheus | https://prom-k3s.example.io | ingressroutes/prometheus-ingress.yaml |
| Alertmanager | https://alertmanager-k3s.example.io | ingressroutes/alertmanager-ingress.yaml |
| Jaeger | https://jaeger-k3s.example.io | ingressroutes/jaeger-ingress.yaml |
DNS CNAME records and Cloudflare tunnel ingress rules are managed by OpenTofu in cloudflare-tunnel-tf/.
Part 1: Prometheus Operator
Section titled “Part 1: Prometheus Operator”Why raw manifests instead of Helm
Section titled “Why raw manifests instead of Helm”The entire stack is deployed as raw Kustomize manifests. No Helm. This gives full visibility into every resource, avoids Helm’s template abstraction layer, and makes it straightforward to patch individual fields. The trade-off is manual version bumps, which is acceptable for a homelab.
Operator CRDs
Section titled “Operator CRDs”The Prometheus Operator provides 10 CRDs totaling ~3.7 MB:
resources: # CRDs (must be applied before operator) - crd-alertmanagerconfigs.yaml - crd-alertmanagers.yaml - crd-podmonitors.yaml - crd-probes.yaml - crd-prometheusagents.yaml - crd-prometheuses.yaml - crd-prometheusrules.yaml - crd-scrapeconfigs.yaml - crd-servicemonitors.yaml - crd-thanosrulers.yaml # Operator RBAC and workload - serviceaccount.yaml - clusterrole.yaml - clusterrolebinding.yaml - deployment.yaml - service.yaml - servicemonitor.yaml - webhook.yamlThe webhook cert-gen Jobs must complete before the operator Deployment starts. Kustomize handles ordering if everything is in the same kustomization.
Prometheus CR
Section titled “Prometheus CR”The operator manages Prometheus via a Prometheus custom resource. It creates a StatefulSet (prometheus-prometheus), a config-reloader sidecar, and handles all ServiceMonitor/PrometheusRule reconciliation:
apiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata: name: prometheus namespace: monitoringspec: version: v3.9.1 image: quay.io/prometheus/prometheus:v3.9.1 replicas: 1 serviceAccountName: prometheus
retention: 7d retentionSize: 8GB
storage: volumeClaimTemplate: spec: storageClassName: nfs-client accessModes: [ReadWriteMany] resources: requests: storage: 10Gi
# Config reloader sidecar resources -- uses strategic merge patch containers: - name: config-reloader resources: requests: cpu: 10m memory: 25Mi limits: cpu: 50m memory: 50Mi
walCompression: true resources: requests: cpu: 200m memory: 512Mi limits: cpu: "2" memory: 2Gi
# Selectors -- all match `release: prometheus` label serviceMonitorSelector: matchLabels: release: prometheus serviceMonitorNamespaceSelector: {} podMonitorSelector: matchLabels: release: prometheus podMonitorNamespaceSelector: {} probeSelector: matchLabels: release: prometheus probeNamespaceSelector: {} ruleSelector: matchLabels: release: prometheus ruleNamespaceSelector: {} scrapeConfigSelector: matchLabels: release: prometheus scrapeConfigNamespaceSelector: {}
alerting: alertmanagers: - namespace: monitoring name: alertmanager port: http-web apiVersion: v2
securityContext: fsGroup: 65534 runAsGroup: 65534 runAsNonRoot: true runAsUser: 65534 seccompProfile: type: RuntimeDefault
externalUrl: https://prom-k3s.example.ioServiceMonitors
Section titled “ServiceMonitors”18 ServiceMonitors scrape targets across the cluster. The release: prometheus label is the common selector:
| ServiceMonitor | Namespace | Target |
|---|---|---|
| prometheus-operator | monitoring | Operator metrics |
| prometheus | monitoring | Prometheus self-metrics |
| alertmanager | monitoring | Alertmanager metrics |
| grafana | monitoring | Grafana metrics |
| kube-state-metrics | monitoring | kube-state-metrics |
| node-exporter | monitoring | Node Exporter (all nodes) |
| blackbox-exporter | monitoring | Blackbox Exporter |
| loki | monitoring | Loki metrics |
| alloy | monitoring | Grafana Alloy (DaemonSet) |
| alloy-logpush | monitoring | Alloy Logpush receiver |
| jaeger | monitoring | Jaeger metrics |
| traefik | traefik | Traefik ingress controller |
| cloudflared | cloudflared | Cloudflare tunnel daemon |
| authentik-metrics | authentik | Authentik server |
| revista | revista | Revista app |
| kubelet | kube-system | Kubelet + cAdvisor |
| coredns | kube-system | CoreDNS |
| apiserver | default | Kubernetes API server |
Cross-namespace ServiceMonitors (traefik, cloudflared, authentik, revista, kubelet, coredns, apiserver) live in monitoring/servicemonitors/ and use namespaceSelector.matchNames to reach across namespaces.
The kubelet ServiceMonitor scrapes three endpoints from the same port: /metrics (kubelet), /metrics/cadvisor (container metrics), and /metrics/probes (probe metrics). All use bearer token auth against the k8s API server CA.
Alert rules
Section titled “Alert rules”Six PrometheusRule CRs provide alerting and recording rules:
| Rule file | Coverage |
|---|---|
general-rules.yaml | Watchdog, InfoInhibitor, TargetDown |
kubernetes-apps.yaml | Pod CrashLoopBackOff, container restarts, Deployment/StatefulSet failures |
kubernetes-resources.yaml | CPU/memory quota overcommit, namespace resource limits |
node-rules.yaml | Node filesystem, memory, CPU, network, clock skew |
k8s-recording-rules.yaml | Pre-computed recording rules for dashboards |
traefik-rules.yaml | Traefik-specific alerting rules |
KEDA autoscaling
Section titled “KEDA autoscaling”Prometheus and Grafana use KEDA ScaledObjects for autoscaling:
# Prometheus: targets the operator-created StatefulSetapiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata: name: prometheus-keda namespace: monitoringspec: scaleTargetRef: apiVersion: apps/v1 kind: StatefulSet name: prometheus-prometheus minReplicaCount: 1 maxReplicaCount: 8 triggers: - type: cpu metadata: type: Utilization value: "50" - type: memory metadata: type: Utilization value: "50"Part 2: Alertmanager
Section titled “Part 2: Alertmanager”Managed by the Prometheus Operator via the Alertmanager CR:
apiVersion: monitoring.coreos.com/v1kind: Alertmanagermetadata: name: alertmanager namespace: monitoringspec: version: v0.31.1 image: quay.io/prometheus/alertmanager:v0.31.1 replicas: 1 serviceAccountName: prometheus
storage: volumeClaimTemplate: spec: storageClassName: nfs-client accessModes: [ReadWriteMany] resources: requests: storage: 1Gi
resources: requests: cpu: 50m memory: 64Mi limits: cpu: 200m memory: 256Mi
securityContext: fsGroup: 65534 runAsGroup: 65534 runAsNonRoot: true runAsUser: 65534 seccompProfile: type: RuntimeDefault
externalUrl: https://alertmanager-k3s.example.ioThe Alertmanager config (routing rules, SMTP credentials) lives in alertmanager/secret.yaml and must be SOPS-encrypted before committing:
sops --encrypt --age <YOUR_AGE_PUBLIC_KEY> \ --encrypted-regex '^(data|stringData)$' \ --in-place monitoring/alertmanager/secret.yamlPart 3: Loki
Section titled “Part 3: Loki”Loki runs in monolithic mode (-target=all) as a single-replica StatefulSet with filesystem storage on NFS.
Configuration
Section titled “Configuration”data: loki.yaml: | target: all auth_enabled: false server: http_listen_port: 3100 grpc_listen_port: 9095 log_level: info common: path_prefix: /loki ring: instance_addr: 0.0.0.0 kvstore: store: inmemory replication_factor: 1 schema_config: configs: - from: "2024-01-01" store: tsdb object_store: filesystem schema: v13 index: prefix: index_ period: 24h storage_config: filesystem: directory: /loki/chunks tsdb_shipper: active_index_directory: /local/tsdb-index # emptyDir, NOT NFS cache_location: /local/tsdb-cache # emptyDir, NOT NFS compactor: working_directory: /loki/compactor compaction_interval: 5m retention_enabled: true delete_request_store: filesystem retention_delete_delay: 2h retention_delete_worker_count: 150 frontend: encoding: protobuf # required for approx_topk compress_responses: true log_queries_longer_than: 5s # log slow queries for investigation query_range: align_queries_with_step: true parallelise_shardable_queries: true cache_results: true results_cache: cache: embedded_cache: enabled: true max_size_mb: 100 # ~100MB RAM for query result cache ttl: 24h shard_aggregations: approx_topk # string, NOT a YAML list chunk_store_config: chunk_cache_config: embedded_cache: enabled: true max_size_mb: 256 # ~256MB RAM for chunk cache ttl: 24h querier: max_concurrent: 16 # 16 parallel workers per instance query_scheduler: max_outstanding_requests_per_tenant: 32768 # TSDB dispatches many small requests limits_config: retention_period: 2160h # 90 days reject_old_samples: true reject_old_samples_max_age: 2160h # 90 days ingestion_rate_mb: 10 ingestion_burst_size_mb: 20 split_queries_by_interval: 15m # split 24h query into 96 sub-queries max_query_parallelism: 32 # up from 2, allows 32 sub-queries in flight tsdb_max_query_parallelism: 64 # TSDB-specific, allows more shards query_timeout: 5m # up from 1m default max_cache_freshness_per_query: 10m max_query_series: 5000 allow_structured_metadata: true volume_enabled: trueKey settings:
| Setting | Value | Why |
|---|---|---|
schema: v13 | TSDB | Latest Loki schema, required for structured metadata |
retention_period: 2160h | 90 days | Long retention for trend analysis and incident postmortems |
reject_old_samples_max_age: 2160h | 90 days | Matches retention period; rejects samples older than this |
max_query_series: 5000 | High | Required for topk queries on high-cardinality Logpush data (see Part 8) |
ingestion_rate_mb: 10 | 10 MB/s | Logpush batches can be large; default was too low |
allow_structured_metadata: true | Required | Enables structured metadata for Alloy’s Traefik access log enrichment |
delete_request_store: filesystem | Required | Must be set when retention_enabled: true, otherwise Loki fails to start |
frontend.encoding: protobuf | Required | Needed for approx_topk to function correctly |
query_range.shard_aggregations: approx_topk | String | Enables approx_topk aggregation sharding; must be a plain string, not a YAML list |
Performance tuning (see Performance Tuning for the full rationale):
| Setting | Value | Why |
|---|---|---|
split_queries_by_interval: 15m | 96 sub-queries per 24h range | Default 1h only creates 24; 15m gives 6.4x more parallelism |
max_query_parallelism: 32 | Up from 2 | Allows 32 sub-queries in the work queue simultaneously |
tsdb_max_query_parallelism: 64 | TSDB-specific | TSDB dynamic sharding generates many individually smaller requests |
querier.max_concurrent: 16 | 16 workers | Grafana recommends ~16 for TSDB; default is 4 |
query_timeout: 5m | Up from 1m default | Large range scans on access logs need more time |
results_cache | Embedded, 100MB, 24h | Repeat queries return instantly from in-memory cache |
chunk_cache_config | Embedded, 256MB, 24h | Avoids re-fetching chunks from NFS for recent data |
tsdb_shipper.active_index_directory | /local/tsdb-index (emptyDir) | TSDB index reads on NFS add 1-10ms per operation; local disk is 0.01-0.1ms |
query_scheduler.max_outstanding_requests_per_tenant: 32768 | High queue | TSDB dispatches many more, individually smaller requests than BoltDB |
StatefulSet
Section titled “StatefulSet”apiVersion: apps/v1kind: StatefulSetmetadata: name: loki namespace: monitoringspec: replicas: 1 serviceName: loki-headless template: spec: securityContext: runAsUser: 10001 runAsGroup: 10001 fsGroup: 10001 runAsNonRoot: true containers: - name: loki image: docker.io/grafana/loki:3.6.5 args: - -config.file=/etc/loki/loki.yaml env: - name: GOMEMLIMIT value: "1600MiB" # 80% of 2Gi limit, prevents OOM - name: GOGC value: "75" # more aggressive GC for ARM64 resources: requests: cpu: 500m memory: 1Gi limits: cpu: 2000m memory: 2Gi volumeMounts: - name: config mountPath: /etc/loki - name: data mountPath: /loki - name: tsdb-local mountPath: /local volumes: - name: tsdb-local emptyDir: sizeLimit: 2Gi volumeClaimTemplates: - metadata: name: data spec: storageClassName: nfs-client accessModes: [ReadWriteMany] resources: requests: storage: 20GiGOMEMLIMIT (Go 1.19+) tells the runtime to start aggressive garbage collection when approaching the limit, preventing OOM kills. Set it to ~80% of the container memory limit. GOGC=75 (default 100) triggers GC slightly more frequently, which helps on memory-constrained ARM64 nodes.
The tsdb-local emptyDir volume stores TSDB index and cache on the node’s local disk instead of NFS. Index lookups are small random reads that compound dramatically over NFS latency. On pod restart, the index is automatically rebuilt from chunks (takes a few minutes), so ephemeral storage is safe here.
WAL auto-recovery init container
Section titled “WAL auto-recovery init container”Loki’s Write-Ahead Log (WAL) on NFS can become corrupted after power outages or unclean shutdowns. Symptoms: Loki enters a crash loop with "segments are not sequential" or stale .tmp checkpoint errors. The init container detects and clears corrupt WAL state before Loki starts:
initContainers: - name: wal-cleanup image: busybox:1.37.0 command: - sh - -c - | WAL_DIR=/loki/wal COMPACTOR_DIR=/loki/compactor
if [ ! -d "$WAL_DIR" ] || [ -z "$(ls -A $WAL_DIR 2>/dev/null)" ]; then echo "wal-cleanup: no WAL directory or empty — clean start" exit 0 fi
# Check for non-sequential WAL segments (gaps cause "segments are not sequential") CORRUPT=false SEGMENTS=$(find "$WAL_DIR" -maxdepth 1 -type f -name '[0-9]*' | sort) if [ -n "$SEGMENTS" ]; then PREV=-1 for SEG in $SEGMENTS; do NUM=$(basename "$SEG" | sed 's/^0*//' | sed 's/^$/0/') if [ "$PREV" -ge 0 ] && [ "$NUM" -ne $((PREV + 1)) ]; then echo "wal-cleanup: gap detected between segment $PREV and $NUM" CORRUPT=true break fi PREV=$NUM done fi
# Check for stale .tmp checkpoint directories if ls -d "$WAL_DIR"/checkpoint.*.tmp 2>/dev/null | grep -q .; then echo "wal-cleanup: stale .tmp checkpoint directories found" CORRUPT=true fi
if [ "$CORRUPT" = "true" ]; then echo "wal-cleanup: corrupt WAL detected — cleaning up" rm -rf "$WAL_DIR"/* "$COMPACTOR_DIR"/* echo "wal-cleanup: cleanup complete — Loki will start with empty WAL" else echo "wal-cleanup: WAL segments are sequential — no cleanup needed" fi volumeMounts: - name: data mountPath: /loki securityContext: runAsUser: 10001 runAsGroup: 10001Two corruption patterns are detected:
- Non-sequential WAL segments: After a power outage, NFS writes may be partially flushed, leaving gaps in the numbered segment files (e.g., segments 0, 1, 3 — gap at 2). Loki’s WAL reader requires strictly sequential segments.
- Stale
.tmpcheckpoint directories: A checkpoint operation interrupted mid-write leaves acheckpoint.*.tmpdirectory. Loki treats this as a fatal error on startup.
When either is detected, the init container wipes both /loki/wal/ and /loki/compactor/. In-flight log lines in the WAL are lost (typically seconds of data), but Loki starts cleanly. Already-flushed chunks on disk are unaffected.
Loki performance tuning
Section titled “Loki performance tuning”The default Loki monolithic configuration is heavily throttled. Out of the box, max_query_parallelism: 2 means a 24h range query is split into 24 one-hour chunks, but only 2 can execute at a time. Combined with NFS-backed TSDB index reads and no caching, dashboard panels with complex LogQL queries (particularly those using | json full-line parsing) were taking 17-24 seconds each.
The fix has three layers:
1. Query parallelism and splitting. split_queries_by_interval: 15m breaks a 24h query into 96 sub-queries instead of 24. max_query_parallelism: 32 allows 32 of those to be in the work queue simultaneously. querier.max_concurrent: 16 runs 16 parallel workers per Loki instance. TSDB’s dynamic sharding further subdivides each time split based on chunk size statistics, targeting 300-600MB per shard. The net effect is that a 24h query that previously ran as 24 serial 1h chunks now fans out across 96 parallel 15m chunks.
2. Embedded caching. Loki supports in-memory caching with zero external dependencies (no memcached/Redis needed). The results cache (100MB) stores completed query responses — repeat queries and dashboard refreshes return instantly. The chunks cache (256MB) stores decompressed chunk data, speeding up first-time queries for recent data by ~30-50%. Total cost: ~356MB of RAM.
3. Local TSDB index. TSDB index directories (active_index_directory, cache_location) are moved from the NFS PVC to an emptyDir volume backed by the node’s local disk. Every index lookup (series resolution, shard planning, chunk reference) was going over NFS at 1-10ms per operation. On local disk, the same operations take 0.01-0.1ms. The index is lightweight and automatically rebuilt from chunks on pod restart, so ephemeral storage is safe.
4. Structured metadata instead of | json parsing. The security dashboard originally used | json | FieldName = "value" on every panel — decompressing and JSON-parsing every log line. After adding downstream_status and user_agent to Alloy’s structured metadata extraction, the dashboard queries were rewritten to filter on SM fields directly (e.g., | downstream_status = "403" instead of | json | DownstreamStatus = "403"). SM filtering happens before line decompression, skipping the expensive JSON parse entirely.
Expected impact (combined):
| Scenario | Before | After |
|---|---|---|
| 24h range query, first run | 17-24s | 2-5s |
| Same query, second run (cached) | 17-24s | under 1s |
| Dashboard with 17 simultaneous panels | Timeouts | 3-8s total |
Part 4: Grafana Alloy
Section titled “Part 4: Grafana Alloy”Grafana Alloy serves two roles in this stack:
- DaemonSet (
alloy/) — runs on every node, collects pod logs and forwards OTLP traces - Deployment (
alloy-logpush/) — single instance, receives Cloudflare Logpush data (covered in Part 7)
DaemonSet configuration
Section titled “DaemonSet configuration”The DaemonSet Alloy discovers pods on its node, tails their log files, and forwards to Loki. It also receives OTLP traces and batches them to Jaeger. Traefik access logs get special treatment: they are parsed as JSON and enriched with structured metadata for the access log dashboard.
logging { level = "info" format = "logfmt"}
// Pod discovery and log collectiondiscovery.kubernetes "pods" { role = "pod" selectors { role = "pod" field = "spec.nodeName=" + coalesce(env("HOSTNAME"), "") }}
discovery.relabel "pod_logs" { targets = discovery.kubernetes.pods.targets
rule { source_labels = ["__meta_kubernetes_pod_phase"] regex = "Pending|Succeeded|Failed|Unknown" action = "drop" } rule { source_labels = ["__meta_kubernetes_namespace"] target_label = "namespace" } rule { source_labels = ["__meta_kubernetes_pod_name"] target_label = "pod" } rule { source_labels = ["__meta_kubernetes_pod_container_name"] target_label = "container" } rule { source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"] separator = "/" target_label = "__path__" replacement = "/var/log/pods/*$1/*.log" }}
local.file_match "pod_logs" { path_targets = discovery.relabel.pod_logs.output}
loki.source.file "pod_logs" { targets = local.file_match.pod_logs.targets forward_to = [loki.process.pod_logs.receiver]}
loki.process "pod_logs" { stage.cri {}
// Traefik access logs: parse JSON, extract structured labels + metadata stage.match { selector = "{namespace=\"traefik\", container=\"traefik\"}"
stage.json { expressions = { status = "DownstreamStatus", downstream_status = "DownstreamStatus", method = "RequestMethod", router = "RouterName", service = "ServiceName", entrypoint = "entryPointName", client_ip = "ClientHost", real_client_ip = "request_X-Real-Client-Ip", bot_score = "request_X-Bot-Score", blocked_by = "request_X-Blocked-By", country = "request_X-Geo-Country", cf_connecting_ip = "request_Cf-Connecting-Ip", request_host = "RequestHost", request_path = "RequestPath", request_protocol = "RequestProtocol", duration = "Duration", origin_duration = "OriginDuration", overhead = "Overhead", downstream_size = "DownstreamContentSize", tls_version = "TLSVersion", user_agent = "request_User-Agent", } }
// Low-cardinality fields → labels (fast filtering) stage.labels { values = { entrypoint = "", method = "", } }
// High-cardinality fields → structured metadata (19 fields) // (queryable but don't create new label streams — requires Loki 3.x + TSDB v13) stage.structured_metadata { values = { status = "", downstream_status = "", router = "", service = "", client_ip = "", real_client_ip = "", bot_score = "", blocked_by = "", country = "", cf_connecting_ip = "", request_host = "", request_path = "", request_protocol = "", duration = "", origin_duration = "", overhead = "", downstream_size = "", tls_version = "", user_agent = "", } }
// Prometheus counters generated from access logs (scraped by Prometheus as loki_process_custom_*) // 7 counters: 1 total + 5 per-block-type + 1 for all 403s stage.metrics { // Total access log requests (all lines in this match block) metric.counter { name = "traefik_access_requests_total" description = "Total Traefik access log requests" match_all = true action = "inc" }
// Blocked by sentinel bot scoring (X-Blocked-By: sentinel) metric.counter { name = "traefik_access_sentinel_blocks_total" description = "Requests blocked by Sentinel bot scoring" source = "blocked_by" value = "sentinel" action = "inc" }
// Blocked by sentinel blocklist (X-Blocked-By: sentinel-blocklist) metric.counter { name = "traefik_access_blocklist_blocks_total" description = "Requests blocked by Sentinel IP blocklist" source = "blocked_by" value = "sentinel-blocklist" action = "inc" }
// Blocked by rate limiting (X-Blocked-By: rate-limit) metric.counter { name = "traefik_access_ratelimit_blocks_total" description = "Requests blocked by Sentinel rate limiting" source = "blocked_by" value = "rate-limit" action = "inc" }
// Blocked by sentinel firewall rules (X-Blocked-By: sentinel-rule) metric.counter { name = "traefik_access_sentinel_rule_blocks_total" description = "Requests blocked by Sentinel firewall rules" source = "blocked_by" value = "sentinel-rule" action = "inc" }
// Tarpitted by sentinel (X-Blocked-By: sentinel-tarpit) metric.counter { name = "traefik_access_tarpit_blocks_total" description = "Requests tarpitted by Sentinel" source = "blocked_by" value = "sentinel-tarpit" action = "inc" }
// 403 responses (any source) metric.counter { name = "traefik_access_403_total" description = "Total 403 responses" source = "downstream_status" value = "403" action = "inc" } }
stage.static_labels { values = { job = "traefik-access-log", } } }
forward_to = [loki.write.default.receiver]}
loki.write "default" { endpoint { url = "http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push" }}
// OTLP trace receiver -> Jaegerotelcol.receiver.otlp "default" { grpc { endpoint = "0.0.0.0:4317" } http { endpoint = "0.0.0.0:4318" } output { traces = [otelcol.processor.batch.default.input] }}
otelcol.processor.batch "default" { output { traces = [otelcol.exporter.otlp.jaeger.input] }}
otelcol.exporter.otlp "jaeger" { client { endpoint = "jaeger-collector.monitoring.svc.cluster.local:4317" tls { insecure = true } }}The pipeline:
discovery.kubernetesdiscovers pods on the current node (filtered byHOSTNAMEenv var)discovery.relabelextracts namespace/pod/container labels and constructs the log file pathloki.source.filetails the CRI log files under/var/log/pods/loki.processapplies thestage.cri {}pipeline to parse CRI-format log linesstage.matchselectively processes Traefik container logs (see below)loki.writepushes to Lokiotelcol.receiver.otlpreceives traces from applications on gRPC 4317 / HTTP 4318otelcol.processor.batchbatches traces for efficiencyotelcol.exporter.otlpforwards to Jaeger’s collector
Traefik access log enrichment
Section titled “Traefik access log enrichment”The stage.match block targets only logs from the traefik namespace/container. Traefik writes two types of log lines: JSON access logs and logfmt debug/error logs. The stage.json parser silently skips non-JSON lines (no-op, no drop), so debug logs pass through unmodified.
Fields are split into two tiers based on cardinality:
| Tier | Fields | Mechanism | Purpose |
|---|---|---|---|
| Labels (low-cardinality) | entrypoint, method | stage.labels | Fast stream selection in LogQL |
| Structured metadata (high-cardinality, 19 fields) | status, downstream_status, router, service, client_ip, real_client_ip, bot_score, blocked_by, country, cf_connecting_ip, request_host, request_path, request_protocol, duration, origin_duration, overhead, downstream_size, tls_version, user_agent | stage.structured_metadata | Queryable without creating new label streams |
Structured metadata is a Loki 3.x feature (requires TSDB v13 schema and allow_structured_metadata: true). Unlike labels, structured metadata does not affect stream identity — adding a new metadata field does not create new streams or increase index size. This is critical for high-cardinality fields like IP addresses and request paths.
Prometheus counters from access logs:
The stage.metrics block generates 7 Prometheus counters from the extracted JSON fields. These appear at Alloy’s /metrics endpoint (scraped by Prometheus) with the loki_process_custom_ prefix:
| Counter (in Prometheus) | Source field | Match condition |
|---|---|---|
loki_process_custom_traefik_access_requests_total | (all lines) | match_all = true |
loki_process_custom_traefik_access_sentinel_blocks_total | blocked_by | = "sentinel" (bot scoring) |
loki_process_custom_traefik_access_blocklist_blocks_total | blocked_by | = "sentinel-blocklist" (IPsum) |
loki_process_custom_traefik_access_ratelimit_blocks_total | blocked_by | = "rate-limit" |
loki_process_custom_traefik_access_sentinel_rule_blocks_total | blocked_by | = "sentinel-rule" (firewall rules) |
loki_process_custom_traefik_access_tarpit_blocks_total | blocked_by | = "sentinel-tarpit" |
loki_process_custom_traefik_access_403_total | downstream_status | = "403" (all sources) |
The source field in stage.metrics reads from the extracted data map populated by stage.json, NOT from structured metadata. The JSON key for the blocked-by header is blocked_by (mapped from the Traefik access log’s request_X-Blocked-By field via stage.json). These counters power the Security Dashboard’s instant-loading aggregate statistics and the Grafana Traefik Access Logs dashboard’s Sentinel Security section.
The stage.static_labels block adds job = "traefik-access-log", letting you query access logs specifically: {job="traefik-access-log"}.
Part 5: Jaeger
Section titled “Part 5: Jaeger”Jaeger v2 uses the OpenTelemetry Collector config format. It runs as an all-in-one Deployment with Badger embedded storage on an NFS PVC (10Gi). The config includes a spanmetrics connector that generates R.E.D. (Rate, Error, Duration) metrics from traces and exports them to Prometheus, plus a metric_backends config that lets Jaeger UI query those metrics for the Monitor tab.
Configuration
Section titled “Configuration”data: ui-config.json: | { "monitor": { "menuEnabled": true }, "dependencies": { "menuEnabled": true } }
config.yaml: | service: extensions: - jaeger_storage - jaeger_query - healthcheckv2 pipelines: traces: receivers: [otlp] processors: [batch] exporters: [jaeger_storage_exporter, spanmetrics] metrics/spanmetrics: receivers: [spanmetrics] exporters: [prometheus] telemetry: resource: service.name: jaeger metrics: level: detailed readers: - pull: exporter: prometheus: host: 0.0.0.0 port: 8888 logs: level: info
extensions: healthcheckv2: use_v2: true http: endpoint: 0.0.0.0:13133 jaeger_query: storage: traces: badger_main metrics: prometheus_store ui: config_file: /etc/jaeger/ui-config.json jaeger_storage: backends: badger_main: badger: directories: keys: /badger/data/keys values: /badger/data/values ephemeral: false ttl: spans: 168h metric_backends: prometheus_store: prometheus: endpoint: http://prometheus.monitoring.svc:9090 normalize_calls: true normalize_duration: true
receivers: otlp: protocols: grpc: { endpoint: 0.0.0.0:4317 } http: { endpoint: 0.0.0.0:4318 }
processors: batch: send_batch_size: 10000 timeout: 5s
connectors: spanmetrics: dimensions: - name: http.method - name: http.status_code - name: http.route aggregation_cardinality_limit: 1500 aggregation_temporality: AGGREGATION_TEMPORALITY_CUMULATIVE metrics_flush_interval: 15s metrics_expiration: 5m
exporters: jaeger_storage_exporter: trace_storage: badger_main prometheus: endpoint: 0.0.0.0:8889 resource_to_telemetry_conversion: enabled: trueThe pipeline architecture has two key features:
- Dual-export traces pipeline: The
tracespipeline fans out to bothjaeger_storage_exporter(Badger storage) and thespanmetricsconnector. The connector generates R.E.D. metrics from every span. - Spanmetrics → Prometheus pipeline: The
metrics/spanmetricspipeline receives metrics from the connector and exports them via Prometheus exporter on port 8889. These metrics (call counts, duration histograms, error rates by service/operation) are scraped by Prometheus and queryable in Grafana. - Metric backends: The
metric_backends.prometheus_storeconfig tells Jaeger’s query extension to read R.E.D. metrics from Prometheus. This powers the Monitor tab in Jaeger UI, showing service-level latency and error rate graphs.normalize_callsandnormalize_durationensure metric names match the OpenTelemetry semantic conventions.
The Deployment uses strategy: Recreate since Badger uses file locking and cannot run multiple instances:
spec: replicas: 1 strategy: type: Recreate template: spec: containers: - name: jaeger image: docker.io/jaegertracing/jaeger:2.15.1 args: [--config, /etc/jaeger/config.yaml] ports: - name: otlp-grpc containerPort: 4317 - name: otlp-http containerPort: 4318 - name: query-http containerPort: 16686 - name: metrics containerPort: 8888 - name: health containerPort: 13133 resources: requests: cpu: 250m memory: 512Mi limits: cpu: 1000m memory: 2GiLog-to-trace correlation
Section titled “Log-to-trace correlation”Loki’s datasource config includes derivedFields that extract trace IDs from log lines and link them to Jaeger:
# In grafana/datasources.yaml- name: Loki type: loki uid: loki url: http://loki.monitoring.svc:3100 jsonData: derivedFields: - datasourceUid: jaeger matcherRegex: '"traceID":"(\w+)"' name: traceID url: "$${__value.raw}"When a log line contains a traceID field, Grafana renders it as a clickable link that opens the trace in Jaeger.
Part 6: Grafana
Section titled “Part 6: Grafana”Authentication with Authentik SSO
Section titled “Authentication with Authentik SSO”Grafana uses Authentik as an OAuth2/OIDC provider:
[auth]oauth_allow_insecure_email_lookup = true
[auth.generic_oauth]enabled = truename = Authentikallow_sign_up = trueauto_login = falsescopes = openid email profileauth_url = https://authentik.example.io/application/o/authorize/token_url = https://authentik.example.io/application/o/token/api_url = https://authentik.example.io/application/o/userinfo/signout_redirect_url = https://authentik.example.io/application/o/grafana/end-session/role_attribute_path = contains(groups, 'Grafana Admins') && 'Admin' || contains(groups, 'Grafana Editors') && 'Editor' || 'Viewer'groups_attribute_path = groupslogin_attribute_path = preferred_usernamename_attribute_path = nameemail_attribute_path = emailuse_pkce = trueuse_refresh_token = trueRole mapping via Authentik groups:
| Authentik Group | Grafana Role |
|---|---|
Grafana Admins | Admin |
Grafana Editors | Editor |
| (everyone else) | Viewer |
Credentials (oauth-client-id, oauth-client-secret) are stored in grafana-secret and injected as env vars. The secret must be SOPS-encrypted.
Datasources
Section titled “Datasources”Four datasources are provisioned via a directly-mounted ConfigMap (not the sidecar):
datasources: - name: Prometheus type: prometheus uid: prometheus url: http://prometheus.monitoring.svc:9090 isDefault: true jsonData: httpMethod: POST timeInterval: 30s
- name: Alertmanager type: alertmanager uid: alertmanager url: http://alertmanager.monitoring.svc:9093 jsonData: implementation: prometheus
- name: Loki type: loki uid: loki url: http://loki.monitoring.svc:3100 jsonData: derivedFields: - datasourceUid: jaeger matcherRegex: '"traceID":"(\w+)"' name: traceID url: "$${__value.raw}"
- name: Jaeger type: jaeger uid: jaeger url: http://jaeger-query.monitoring.svc:16686Deployment
Section titled “Deployment”The Grafana Deployment has two containers: the k8s-sidecar for dashboard provisioning and Grafana itself:
containers: - name: grafana-sc-dashboard image: quay.io/kiwigrid/k8s-sidecar:2.5.0 env: - name: LABEL value: grafana_dashboard - name: LABEL_VALUE value: "1" - name: METHOD value: WATCH - name: FOLDER value: /tmp/dashboards - name: NAMESPACE value: ALL - name: RESOURCE value: configmap resources: requests: cpu: 50m memory: 64Mi
- name: grafana image: docker.io/grafana/grafana:12.3.3 env: - name: GF_SECURITY_ADMIN_USER valueFrom: secretKeyRef: name: grafana-secret key: admin-user - name: GF_SECURITY_ADMIN_PASSWORD valueFrom: secretKeyRef: name: grafana-secret key: admin-password - name: GF_AUTH_GENERIC_OAUTH_CLIENT_ID valueFrom: secretKeyRef: name: grafana-secret key: oauth-client-id - name: GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET valueFrom: secretKeyRef: name: grafana-secret key: oauth-client-secret resources: requests: cpu: 100m memory: 128MiThe sidecar in WATCH mode detects ConfigMaps with grafana_dashboard: "1" across all namespaces and writes them to /tmp/dashboards. Grafana’s dashboard provider reads from that directory.
Dashboard management
Section titled “Dashboard management”All 16 dashboards are standalone .json files managed by kustomize configMapGenerator:
generatorOptions: disableNameSuffixHash: true labels: grafana_dashboard: "1"
configMapGenerator: - name: alertmanager-dashboard files: - alertmanager.json - name: cloudflare-logpush-dashboard files: - cloudflare-logpush.json # ... 14 more entries (16 total)This replaced the previous approach of inlining dashboard JSON inside YAML ConfigMaps. The benefits:
- JSON files get proper syntax highlighting in editors
- No YAML escaping issues with special characters in JSON
- Files can be imported/exported directly from Grafana’s UI
- Easy to diff and review in git
| Dashboard | Source | Panels |
|---|---|---|
| Alertmanager | grafana.com | ~6 |
| Alloy | grafana.com | ~30 |
| Authentik | grafana.com | ~20 |
| Blackbox Exporter | grafana.com | ~12 |
| Cloudflare Logpush | custom gen script | 135 |
| Cloudflare Tunnel | custom gen script | 67 |
| CoreDNS | grafana.com | ~15 |
| Grafana Stats | grafana.com | ~8 |
| Jaeger | grafana.com | ~20 |
| K8s Cluster | grafana.com | ~15 |
| Loki | grafana.com | ~40 |
| Node Exporter | grafana.com | ~40 |
| Prometheus | grafana.com | ~35 |
| Security | custom | ~20 |
| Traefik | grafana.com | ~25 |
| Traefik Access Logs | custom | ~15 |
Adding upstream dashboards from grafana.com:
cd monitoring/grafana/dashboards/./add-dashboard.sh <gnet-id> <name> [revision]
# Example:./add-dashboard.sh 1860 node-exporter 37The script downloads the JSON, replaces all datasource template variables with hardcoded UIDs (prometheus, loki), strips __inputs/__requires, fixes deprecated panel types (grafana-piechart-panel -> piechart), writes a standalone .json file, and adds a configMapGenerator entry.
Regenerating custom dashboards:
python3 gen-cloudflare-logpush.py # 135 panels (123 content + 12 section rows)python3 gen-cloudflare-logpush.py --export # Portable export for grafana.com sharingpython3 gen-cloudflared.py # 67 panels (58 content + 9 section rows)python3 gen-cloudflared.py --export # Portable export for grafana.com sharingBoth generators support --export which replaces hardcoded datasource UIDs with template variables (${DS_LOKI}, ${DS_PROMETHEUS}) and adds __inputs/__requires arrays for Grafana.com compatibility. Export files are written with an -export suffix.
Python dashboard generators
Section titled “Python dashboard generators”Custom dashboards are generated by Python scripts rather than hand-edited JSON. A 125-panel dashboard is thousands of lines of JSON but only ~1200 lines of Python with reusable helper functions:
Both generators share the same architecture: helper functions produce Grafana panel dictionaries, which are assembled into a dashboard JSON and written to disk. All helpers accept a desc="" parameter for panel descriptions (shown as tooltips in Grafana).
gen-cloudflare-logpush.py (Loki datasource):
#!/usr/bin/env python3"""Generate the Cloudflare Logpush Grafana dashboard JSON."""import json, sysfrom country_codes import COUNTRY_NAMES # 249 ISO 3166-1 Alpha-2 entries
EXPORT = "--export" in sys.argvDS = {"type": "loki", "uid": "${DS_LOKI}"} if EXPORT else {"type": "loki", "uid": "loki"}
# Helper functions - all accept desc="" for panel descriptionsdef stat_panel(id, title, expr, legend, x, y, w=6, unit="short", thresholds=None, instant=True, desc=""): ...def ts_panel(id, title, targets, x, y, w=12, h=8, unit="short", stack=True, overrides=None, fill=20, legend_calcs=None, desc=""): ...def table_panel(id, title, expr, legend, x, y, w=8, h=8, extra_overrides=None, desc=""): ...def pie_panel(id, title, expr, legend, x, y, w=6, h=8, overrides=None, desc=""): ... # legend.placement: "right" for 10+ slicesdef bar_panel(id, title, targets, x, y, w=12, h=8, unit="short", stack=True, overrides=None, desc=""): ...def geomap_panel(id, title, expr, lookup_field, x, y, w=16, h=10, desc=""): ...# Selective JSON parsing - only extract fields each query needsdef http(*fields): """Build LogQL selector for http_requests with template variable filters.""" # Always includes _HTTP_FILTER_FIELDS (ClientRequestHost, ClientCountry, # ClientRequestPath, ClientIP, JA4, ClientASN, EdgeColoCode) for filteringdef fw(*fields): """Build LogQL selector for firewall_events."""def wk(*fields): """Build LogQL selector for workers_trace_events."""
# Override helpers for human-readable labelsdef country_name_overrides(): ... # ISO Alpha-2 → country namedef country_value_mappings_override(column_name): ...gen-cloudflared.py (Prometheus datasource):
#!/usr/bin/env python3"""Generate the Cloudflare Tunnel (cloudflared) Grafana dashboard JSON."""import json, os
DS = {"type": "prometheus", "uid": "prometheus"}
# Same helper pattern, plus cloudflared-specific panel typesdef stat_panel(id, title, expr, legend, x, y, w=6, unit="short", thresholds=None, decimals=None, desc="", mappings=None): ...def ts_panel(id, title, targets, x, y, w=12, h=8, unit="short", stack=False, overrides=None, fill=20, desc="", legend_calcs=None): ...def gauge_panel(id, title, expr, legend, x, y, w=6, h=6, unit="percent", thresholds=None, desc="", min_val=0, max_val=100): ...def table_panel(id, title, expr, legend, x, y, w=12, h=8, desc=""): ...def text_panel(id, content, x, y, w=24, h=4, title="", desc=""): ...Key design difference: the Logpush generator uses selective JSON parsing (| json field1, field2 instead of full | json) because Logpush events have ~72 fields. Each query extracts only the fields it needs, plus filter fields for template variable support. The cloudflared generator uses standard PromQL since Prometheus metrics are already structured.
The ts_panel legend_calcs parameter controls which calculations appear in the legend footer. Default is ["sum", "mean"] for Logpush (count-based) and ["mean", "max"] for cloudflared (gauge-based). Ratio panels and timing panels override this to ["mean", "lastNotNull"].
Part 7: Cloudflare Logpush Pipeline
Section titled “Part 7: Cloudflare Logpush Pipeline”This is the most complex part of the stack. Cloudflare Logpush pushes HTTP request logs, firewall events, and Workers trace events as gzip-compressed NDJSON to an HTTPS endpoint on the cluster. The challenge: Alloy’s /loki/api/v1/raw endpoint does not handle gzip, and Traefik has no built-in request body decompression.
The compression problem
Section titled “The compression problem”When Cloudflare Logpush sends data to an HTTP destination:
- Logpush always gzip-compresses HTTP payloads — no way to disable this
- Alloy’s
loki.source.api/loki/api/v1/rawdoes not handleContent-Encoding: gzip— confirmed by reading Alloy source. Only/loki/api/v1/push(protobuf/JSON) handles gzip - Traefik’s
compressmiddleware only handles response compression, not request body decompression
This means a decompression layer is needed between Cloudflare and Alloy.
The Traefik decompress plugin
Section titled “The Traefik decompress plugin”I wrote a Traefik Yaegi (Go interpreter) local plugin that intercepts Content-Encoding: gzip requests, decompresses the body, and passes through to the next handler:
package decompress
import ( "bytes" "compress/gzip" "context" "fmt" "io" "net/http" "strconv" "strings")
type Config struct{}
func CreateConfig() *Config { return &Config{} }
type Decompress struct { next http.Handler name string}
func New(ctx context.Context, next http.Handler, config *Config, name string) (http.Handler, error) { return &Decompress{next: next, name: name}, nil}
func (d *Decompress) ServeHTTP(rw http.ResponseWriter, req *http.Request) { encoding := strings.ToLower(req.Header.Get("Content-Encoding")) if encoding != "gzip" { d.next.ServeHTTP(rw, req) return }
gzReader, err := gzip.NewReader(req.Body) if err != nil { http.Error(rw, fmt.Sprintf("failed to create gzip reader: %v", err), http.StatusBadRequest) return } defer gzReader.Close()
decompressed, err := io.ReadAll(gzReader) if err != nil { http.Error(rw, fmt.Sprintf("failed to decompress body: %v", err), http.StatusBadRequest) return }
req.Body = io.NopCloser(bytes.NewReader(decompressed)) req.ContentLength = int64(len(decompressed)) req.Header.Set("Content-Length", strconv.Itoa(len(decompressed))) req.Header.Del("Content-Encoding")
d.next.ServeHTTP(rw, req)}Published at github.com/erfianugrah/decompress, tagged v0.1.0.
Deploying the plugin on k3s
Section titled “Deploying the plugin on k3s”Traefik loads local plugins from /plugins-local/src/<moduleName>/. Since Traefik runs with readOnlyRootFilesystem: true, the plugin files are packaged as a ConfigMap and mounted:
Step 1: ConfigMap in traefik namespace containing decompress.go, go.mod, .traefik.yml:
apiVersion: v1kind: ConfigMapmetadata: name: traefik-plugin-decompress namespace: traefikdata: decompress.go: | package decompress // ... (full Go source) go.mod: | module github.com/erfianugrah/decompress go 1.22 .traefik.yml: | displayName: Decompress Request Body type: middleware import: github.com/erfianugrah/decompress summary: Decompresses gzip-encoded request bodies for upstream services. testData: {}Step 2: Volume mount in Traefik Deployment:
volumeMounts: - name: plugin-decompress mountPath: /plugins-local/src/github.com/erfianugrah/decompress readOnly: truevolumes: - name: plugin-decompress configMap: name: traefik-plugin-decompressStep 3: Traefik arg to enable the plugin:
args: - "--experimental.localPlugins.decompress.moduleName=github.com/erfianugrah/decompress"Step 4: Middleware CRD (must be in same namespace as IngressRoute):
apiVersion: traefik.io/v1alpha1kind: Middlewaremetadata: name: decompress namespace: monitoringspec: plugin: decompress: {}Dataset-agnostic Alloy receiver
Section titled “Dataset-agnostic Alloy receiver”The Alloy Logpush receiver runs as a separate Deployment. The key design: it knows nothing about individual Logpush datasets. Each job injects a _dataset field via output_options.record_prefix, and Alloy extracts only that as a label:
loki.source.api "cloudflare" { http { listen_address = "0.0.0.0" listen_port = 3500 } labels = { job = "cloudflare-logpush", } forward_to = [loki.process.cloudflare.receiver]}
loki.process "cloudflare" { stage.json { expressions = { dataset = "_dataset" } } stage.labels { values = { dataset = "dataset" } } forward_to = [loki.write.default.receiver]}
loki.write "default" { endpoint { url = "http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push" }}Adding a new Logpush dataset requires zero Alloy changes — just create the job with the right record_prefix and data flows automatically.
DNS and tunnel routing
Section titled “DNS and tunnel routing”The Logpush endpoint needs a public HTTPS URL. This is provided by the Cloudflare Tunnel:
resource "cloudflare_record" "logpush-k3s" { zone_id = var.cloudflare_secondary_zone_id name = "logpush-k3s" type = "CNAME" content = cloudflare_zero_trust_tunnel_cloudflared.k3s.cname proxied = true tags = ["k3s", "monitoring"]}
# cloudflare-tunnel-tf/tunnel_config.tfingress_rule { hostname = "logpush-k3s.${var.secondary_domain_name}" service = "https://traefik.traefik.svc.cluster.local" origin_request { origin_server_name = "logpush-k3s.${var.secondary_domain_name}" http2_origin = true no_tls_verify = true }}The IngressRoute ties hostname, middleware, and backend together:
apiVersion: traefik.io/v1alpha1kind: IngressRoutemetadata: name: alloy-logpush namespace: monitoringspec: entryPoints: - websecure routes: - kind: Rule match: Host(`logpush-k3s.example.io`) middlewares: - name: decompress namespace: monitoring services: - kind: Service name: alloy-logpush port: 3500OpenTofu Logpush jobs
Section titled “OpenTofu Logpush jobs”Seven Logpush jobs are managed in OpenTofu. Shared config uses locals:
logpush_loki_dest = "https://logpush-k3s.example.io/loki/api/v1/raw?header_Content-Type=application%2Fjson&header_X-Logpush-Secret=${var.logpush_secret}"
zone_ids = { example_com = var.cloudflare_zone_id example_dev = var.secondary_cloudflare_zone_id example_io = var.thirdary_cloudflare_zone_id}The destination URL uses Logpush’s header_ query parameter syntax to inject Content-Type and a shared secret header.
HTTP requests (one per zone, using for_each):
resource "cloudflare_logpush_job" "http_loki" { for_each = local.zone_ids
dataset = "http_requests" destination_conf = local.logpush_loki_dest enabled = true max_upload_interval_seconds = 30
output_options { output_type = "ndjson" record_prefix = "{\"_dataset\":\"http_requests\"," field_names = local.http_requests_fields timestamp_format = "rfc3339" cve20214428 = false }
zone_id = each.value}Firewall events (same pattern, for_each over zones):
resource "cloudflare_logpush_job" "firewall_loki" { for_each = local.zone_ids
dataset = "firewall_events" destination_conf = local.logpush_loki_dest enabled = true
output_options { output_type = "ndjson" record_prefix = "{\"_dataset\":\"firewall_events\"," field_names = local.firewall_events_fields }
zone_id = each.value}Workers trace events (account-scoped, single job):
resource "cloudflare_logpush_job" "workers_loki" { dataset = "workers_trace_events" destination_conf = local.logpush_loki_dest enabled = true
output_options { output_type = "ndjson" record_prefix = "{\"_dataset\":\"workers_trace_events\"," field_names = local.workers_trace_events_fields }
account_id = var.cloudflare_account_id}The record_prefix trick prepends {"_dataset":"http_requests", to every JSON line, producing:
{"_dataset":"http_requests","ClientIP":"1.2.3.4","RayID":"abc123",...}Alloy extracts _dataset as a label; everything else stays in the log line for LogQL | json.
| Dataset | Scope | Jobs | Zones |
|---|---|---|---|
http_requests | Zone | 3 | example.com, example.dev, example.io |
firewall_events | Zone | 3 | example.com, example.dev, example.io |
workers_trace_events | Account | 1 | (all Workers) |
| Total | 7 |
Cloudflare Logpush dashboard
Section titled “Cloudflare Logpush dashboard”The custom dashboard has 135 panels (123 content + 12 section rows) across 12 sections, generated by gen-cloudflare-logpush.py. Every panel has a description tooltip explaining what it shows and how to interpret it. Published on Grafana.com as dashboard 24873.
| Section | Panels | Key visualizations |
|---|---|---|
| Overview | 8 stats | Request count, 5xx error rate, cache hit ratio, WAF attacks, bot traffic %, leaked credentials, JS detection pass rate, content scan rate |
| HTTP Requests | 22 | By host/status/method/protocol, top paths, suspicious user agents (BotScore < 30), top IPs, top ASNs, top countries, JA4 fingerprints, edge colos, device types, geomap, request lifecycle breakdown (client-edge-origin latency buckets) |
| Performance | 13 | Edge TTFB (avg/p95/p99 by host), origin timing breakdown (DNS/TCP/TLS/request/response as stacked area), client-edge RTT, request lifecycle (edge processing vs origin vs client), timing heatmaps |
| Cache Performance | 11 | Cache status distribution, hit ratio trend, tiered cache fill, cache status by host, cacheable vs uncacheable, compression ratio, content types by cache status |
| Security & Firewall | 13 | Firewall events by action/source/host/rule, top rules, firewall event timeline, top blocked IPs/paths/countries |
| API & Rate Limiting | 9 | API classification breakdown, API-matched vs unmatched, rate limit actions, API requests by host/method |
| WAF Attack Analysis | 6 | Attack score buckets (0-20 is attack), SQLi/XSS/RCE score breakdown, unmitigated attacks (high score + no action), attack source countries |
| Threat Intelligence | 9 | Leaked credential pairs, IP classification (Tor/VPN/botnet), geo anomaly on sensitive paths (login/admin/api), client IP reputation, threat score distribution |
| Bot Analysis | 8 | Bot score distribution, bot detection IDs (33 mapped IDs), JA4/JA3 fingerprints, verified bot categories, bot score vs WAF action correlation, JS detection results |
| Request Rate Analysis | 7 | Request rate by path (topk for timeseries), top paths by count (count_over_time for tables), rate by status, rate by host |
| Request & Response Size | 6 | Per-host bandwidth panels (CF→Eyeball charged, Origin→CF informational), request/response body size distributions |
| Workers | 9 | CPU/wall time by script (p50/p95/p99), outcomes (ok/exception/exceeded), subrequest count, execution duration heatmap, wall time breakdown |
Selective JSON parsing: Each LogQL query uses | json field1, field2 to extract only the fields it needs instead of parsing all ~72 Logpush fields. A set of filter fields (ClientRequestHost, ClientCountry, ClientRequestPath, ClientIP, JA4, ClientASN, EdgeColoCode) is always included by the http() helper to support template variable filtering across all panels.
High-cardinality aggregation: Tables and “top N” panels use approx_topk (Loki 3.3+) instead of topk for probabilistic aggregation via count-min sketch. This requires query_range.shard_aggregations: approx_topk and frontend.encoding: protobuf in Loki config.
ASN and country name resolution: Raw ASN numbers and ISO Alpha-2 country codes are mapped to human-readable names using Grafana value mappings. country_codes.py has 249 entries (all ISO 3166-1 countries). ASN names are resolved live from the ClientASNDescription field in Cloudflare’s firewall_events dataset — no static ASN lookup table needed. The firewall events dataset includes the ISP/organization name for every ASN, which the dashboard queries directly via LogQL.
Template variables are textbox type with .* default (matches everything). Grafana’s label_values() only works for indexed Loki labels, not JSON-extracted fields — since all fields are in the JSON body, textbox is the only practical option. Available filters: Host, Country, Path, Client IP, JA4 fingerprint, ASN, Edge Colo.
Cloudflared tunnel dashboard
Section titled “Cloudflared tunnel dashboard”The cloudflared dashboard has 67 panels (58 content + 9 section rows) across 9 sections, generated by gen-cloudflared.py. It covers tunnel health, capacity planning, QUIC transport internals, latency analysis, and process resource monitoring — all from cloudflared’s native Prometheus metrics endpoint. Published on Grafana.com as dashboard 24874.
| Section | Panels | Key visualizations |
|---|---|---|
| Tunnel Overview | 12 stat panels | Requests/sec, error rate %, HA connections, concurrent requests, stream errors/sec, version, config version, registrations, TCP/UDP sessions, heartbeat retries, total requests |
| Tunnel Capacity & Scaling | 7 | Two-tier model: HTTP (concurrent requests gauge, req/s gauge, throughput timeseries) + WARP/private network (TCP/UDP port capacity gauges, port capacity % over time). Scaling guidelines text panel with limitations warning |
| Traffic | 4 | Requests/sec with errors overlay, response status codes (color-coded 2xx/3xx/4xx/5xx), error rate % trend, stacked response codes |
| Connections & Sessions | 8 | HA connections per pod, concurrent requests per tunnel, TCP sessions (active gauge + new/sec rate), UDP sessions, proxy stream errors, heartbeat retries, ICMP traffic, tunnel registrations |
| Edge Locations | 2 | Active edge server locations table (conn_id → edge PoP mapping), config version over time |
| QUIC Transport | 9 | RTT to edge (smoothed/min/latest per connection), congestion window bytes, bytes sent/received (aggregate + per-connection), packet loss by reason, congestion state with value mappings (0=SlowStart, 1=CongestionAvoidance, 2=Recovery, 3=ApplicationLimited), MTU/max payload, QUIC frames sent/received by type |
| Latency | 4 | Proxy connect latency (p50/p95/p99 histogram quantiles), RPC client latency, RPC server latency, proxy connect latency heatmap |
| RPC Operations | 2 | RPC client operations by handler/method, RPC server operations by handler/method |
| Process Resources | 8 | CPU usage, memory (RSS/Go heap/idle spans), network I/O (TX/RX bytes/sec), goroutines, open file descriptors vs limit, GC duration, heap objects, memory allocation rate |
Two-tier capacity model: The Capacity & Scaling section separates HTTP and WARP/private network traffic because they have fundamentally different scaling characteristics:
- HTTP-only tunnels: Requests are multiplexed over QUIC streams on 4 HA connections. No host ephemeral ports are consumed. Primary metrics:
cloudflared_tunnel_concurrent_requests_per_tunnelandrate(cloudflared_tunnel_total_requests). - WARP/private network tunnels: TCP/UDP sessions consume host ephemeral ports. Cloudflare’s sizing calculator applies: TCP capacity =
sessions/sec ÷ available_ports, UDP capacity =sessions/sec × dns_timeout ÷ available_ports.
TCP/UDP session metrics read 0 for HTTP-only tunnels — this is correct, not a bug.
Scaling limitations (documented in the dashboard’s guidelines text panel):
- cloudflared has no auto-scaling capability — replicas are HA only, not load-balanced
- Scaling down breaks active eyeball connections (no graceful drain)
- For true horizontal scaling, use multiple discrete tunnels behind a load balancer
QUIC transport: cloudflared connects to Cloudflare edge via QUIC with 4 HA connections per replica. The QUIC section surfaces connection-level metrics that are otherwise invisible: RTT per connection (smoothed EWMA used by congestion control, minimum floor, latest sample), congestion state transitions, packet loss reasons, and frame-level counters. State 3 (ApplicationLimited) is normal for low-traffic tunnels.
Metric discovery note: cloudflared_tunnel_active_streams appears in Cloudflare’s documentation but is not emitted by cloudflared 2026.2.0. The dashboard uses cloudflared_proxy_connect_streams_errors for stream error tracking instead.
Template variables:
| Variable | Type | Default | Purpose |
|---|---|---|---|
job | query | cloudflared-metrics | Auto-discovered from cloudflared_tunnel_ha_connections |
available_ports | custom | 50000 | Ephemeral ports per host (50000/30000/16384) for WARP capacity gauges |
dns_timeout | custom | 5 | DNS UDP session timeout (5/10/30 sec) for UDP capacity calculation |
Part 8: LogQL Pitfalls
Section titled “Part 8: LogQL Pitfalls”Working with high-cardinality Cloudflare Logpush data in Loki exposed twelve specific traps. These cost real debugging time — the error messages are often unhelpful.
1. count_over_time without sum() explodes series
Section titled “1. count_over_time without sum() explodes series”# BAD: one series per unique log linecount_over_time({job="cloudflare-logpush"} | json [5m])
# GOOD: single aggregated countsum(count_over_time({job="cloudflare-logpush"} | json [5m]))After | json, every extracted field becomes a potential label. Without sum(), count_over_time returns one series per unique label combination — easily hitting max_query_series.
2. unwrap aggregations don’t support by ()
Section titled “2. unwrap aggregations don’t support by ()”# BAD: parse erroravg_over_time(... | unwrap EdgeTimeToFirstByteMs [5m]) by (Host)
# GOOD: outer aggregation for groupingsum by (Host) (avg_over_time(... | unwrap EdgeTimeToFirstByteMs [5m]))3. Stat panels need instant: true
Section titled “3. Stat panels need instant: true”Without instant: true, Loki returns a range result. The stat panel picks lastNotNull which may not reflect the full window. Set "queryType": "instant", "instant": true on stat panel targets.
4. $__auto vs fixed intervals
Section titled “4. $__auto vs fixed intervals”- Time series panels:
[$__auto]— adapts to the visible time range - Table panels:
[5m]fixed —$__autocreates too many evaluation windows - Stat panels:
[5m]withinstant: true
5. Cannot compare two extracted fields
Section titled “5. Cannot compare two extracted fields”# IMPOSSIBLE: compare two extracted fields{...} | json | OriginResponseStatus != EdgeResponseStatusLogQL can only compare extracted fields to literal values. Use two queries or dashboard transformations.
6. unwrap produces one series per stream
Section titled “6. unwrap produces one series per stream”Always wrap unwrap aggregations in an outer sum() or avg by ():
# BAD: one series per label combinationavg_over_time(... | unwrap EdgeTimeToFirstByteMs [$__auto])
# GOOD: collapsedsum(avg_over_time(... | unwrap EdgeTimeToFirstByteMs [$__auto]))7. max_query_series applies to inner cardinality
Section titled “7. max_query_series applies to inner cardinality”topk(10, sum by (Path) (count_over_time(... | json [5m])))Loki evaluates sum by (Path) first. If there are thousands of unique paths (bots/scanners), it exceeds max_query_series before topk ever runs. Reducing the time window does not help — the cardinality is inherent in the data.
8. High-cardinality topk requires high max_query_series
Section titled “8. High-cardinality topk requires high max_query_series”Even a 1-second scan window can have 1500+ unique paths due to bots. Raised max_query_series to 5000:
limits_config: max_query_series: 5000The memory impact on single-instance homelab Loki is negligible for instant queries.
9. Table panels with [$__auto] hit series limits
Section titled “9. Table panels with [$__auto] hit series limits”Combines pitfalls 4, 7, and 8. Over a 24h range, $__auto might resolve to 15-second intervals, creating many evaluation windows. Use [5m] fixed for all table instant queries.
10. approx_topk solves the topk cardinality problem
Section titled “10. approx_topk solves the topk cardinality problem”Loki 3.3 added approx_topk — a probabilistic alternative to topk that uses a count-min sketch instead of materializing all inner series:
# Instead of:topk(10, sum by (ClientRequestPath) (count_over_time(... | json ClientRequestPath [$__auto])))
# Use:approx_topk(10, sum by (ClientRequestPath) (count_over_time(... | json ClientRequestPath [$__auto])))This avoids hitting max_query_series on high-cardinality fields. Requires two config settings in Loki:
query_range: shard_aggregations: approx_topk # string, NOT a YAML listfrontend: encoding: protobuf # required for approx_topkDrop-in replacement for topk on instant queries (table panels). Results are approximate but accurate enough for dashboard “top N” panels.
11. Selective | json reduces query cost dramatically
Section titled “11. Selective | json reduces query cost dramatically”Full | json extracts all ~72 Logpush fields as labels for every log line. Most queries only need 2-3 fields:
# BAD: extracts all 72 fields{job="cloudflare-logpush"} | json | ClientRequestHost =~ "$host"
# GOOD: extracts only needed fields{job="cloudflare-logpush"} | json ClientRequestHost, EdgeResponseStatus | ClientRequestHost =~ "$host"The Logpush generator’s http(), fw(), and wk() helpers automatically include template variable filter fields plus whatever fields the specific query needs. This reduced query latency significantly on the homelab Loki instance.
12. Derived metric subtraction requires per-line computation and data quality filtering
Section titled “12. Derived metric subtraction requires per-line computation and data quality filtering”Computing “Edge Processing = TTFB - Origin Duration” has two layered pitfalls:
Problem 1 - Aggregation ordering: Subtracting two independently aggregated unwrap queries is wrong because each operates on a potentially different sample population (cache hits vs origin-fetched requests). For percentiles it’s also mathematically invalid: p99(A) - p99(B) ≠ p99(A - B).
# BAD: subtracts two independent aggregationssum(avg_over_time(... | unwrap EdgeTimeToFirstByteMs [$__auto]))- sum(avg_over_time(... | unwrap OriginResponseDurationMs [$__auto]))Problem 2 - EdgeTimeToFirstByteMs is capped at 65535 (uint16): Cloudflare’s logging truncates this field at 2^16-1. When an origin takes longer than ~65 seconds, TTFB saturates at 65535 while OriginResponseDurationMs keeps counting (observed up to 661,156ms / ~11 minutes). The per-line subtraction then produces massively negative values (e.g., -595,621ms). This affects ~0.2% of traffic, typically DNS servers or backends with long timeouts.
Fix: Use label_format with Loki’s subf template function for per-line subtraction, AND filter out requests where TTFB hit the uint16 cap:
# GOOD: per-line subtraction with uint16 overflow filtersum(avg_over_time( ... | json EdgeTimeToFirstByteMs, OriginResponseDurationMs | EdgeTimeToFirstByteMs < 65535 | label_format EdgeProcessingMs="{{ subf .EdgeTimeToFirstByteMs .OriginResponseDurationMs }}" | unwrap EdgeProcessingMs [$__auto]))
# Percentiles now work correctly - true p99 of per-request edge processing timesum(quantile_over_time(0.99, ... | EdgeTimeToFirstByteMs < 65535 | label_format EdgeProcessingMs="{{ subf .EdgeTimeToFirstByteMs .OriginResponseDurationMs }}" | unwrap EdgeProcessingMs [$__auto]))Part 9: VyOS Metrics
Section titled “Part 9: VyOS Metrics”Probe vs ScrapeConfig
Section titled “Probe vs ScrapeConfig”The VyOS router runs node_exporter on port 9100 (HTTPS, self-signed cert). Initially we used the Prometheus Probe CRD, but it routes through blackbox exporter, producing only probe_* metrics — not the actual node_* metrics. VyOS never appeared in the Node Exporter dashboard.
The fix: ScrapeConfig CRD for direct scraping:
apiVersion: monitoring.coreos.com/v1alpha1kind: ScrapeConfigmetadata: name: vyos-nl namespace: monitoring labels: release: prometheusspec: metricsPath: /metrics scheme: HTTPS tlsConfig: insecureSkipVerify: true staticConfigs: - targets: - prom-vyos.example.com labels: job: node-exporter instance: prom-vyos.example.com scrapeInterval: 30s| Aspect | Probe CRD | ScrapeConfig CRD |
|---|---|---|
| Path | Prometheus -> blackbox -> target | Prometheus -> target (direct) |
| Metrics | probe_* only | All target metrics |
| Use case | Endpoint availability | Actual metric scraping |
With job: node-exporter, VyOS appears in the Node Exporter dashboard alongside cluster nodes.
Secrets Management
Section titled “Secrets Management”Two Secret files require SOPS encryption before committing:
| File | Contents |
|---|---|
monitoring/grafana/secret.yaml | admin-user, admin-password, oauth-client-id, oauth-client-secret |
monitoring/alertmanager/secret.yaml | Alertmanager config with SMTP credentials |
sops --encrypt --age <YOUR_AGE_PUBLIC_KEY> \ --encrypted-regex '^(data|stringData)$' \ --in-place monitoring/grafana/secret.yaml
sops --encrypt --age <YOUR_AGE_PUBLIC_KEY> \ --encrypted-regex '^(data|stringData)$' \ --in-place monitoring/alertmanager/secret.yamlOpenTofu secrets (logpush_secret, zone IDs, API tokens) live in SOPS-encrypted secrets.tfvars:
sops -d secrets.tfvars > /tmp/secrets.tfvarstofu plan -var-file=/tmp/secrets.tfvarstofu apply -var-file=/tmp/secrets.tfvarsrm /tmp/secrets.tfvarsDeployment
Section titled “Deployment”Full stack deploy
Section titled “Full stack deploy”# 1. Deploy the entire monitoring stack (includes all components)kubectl apply -k monitoring/ --server-side --force-conflicts
# 2. Deploy decompress plugin + middleware (separate from monitoring kustomization)kubectl apply -f middleware/decompress-configmap.yamlkubectl apply -f middleware/decompress-middleware.yaml
# 3. Deploy updated Traefik with plugin enabledkubectl apply -f services/traefik.yaml
# 4. Deploy ingress routeskubectl apply -f ingressroutes/grafana-ingress.yamlkubectl apply -f ingressroutes/prometheus-ingress.yamlkubectl apply -f ingressroutes/alertmanager-ingress.yamlkubectl apply -f ingressroutes/jaeger-ingress.yamlkubectl apply -f ingressroutes/alloy-logpush-ingress.yaml
# 5. Deploy KEDA autoscalingkubectl apply -f hpa/grafana-keda-autoscaling.yamlkubectl apply -f hpa/prom-keda-autoscaling.yaml
# 6. Apply OpenTofu for DNS + tunnel configcd cloudflare-tunnel-tf/ && tofu apply
# 7. Apply OpenTofu for Logpush jobscd ../cloudflare-tf/main_zone/tofu apply -var-file=secrets.tfvars--server-side is required because the Prometheus Operator CRDs and Node Exporter dashboard exceed the 262144-byte annotation limit. IngressRoutes and KEDA ScaledObjects are outside the monitoring/ kustomization directory because Kustomize cannot reference files outside its root.
Verification
Section titled “Verification”# All pods runningkubectl get pods -n monitoring
# Prometheus targetskubectl port-forward svc/prometheus 9090 -n monitoring# Visit http://localhost:9090/targets -- all should be UP
# Loki receiving datakubectl logs deploy/alloy-logpush -n monitoring --tail=20
# Logpush data flowing# In Grafana Explore with Loki:# {job="cloudflare-logpush"} | json
# Dashboard ConfigMapskubectl get cm -n monitoring -l grafana_dashboard=1# Should show 16 ConfigMapsResource Budget
Section titled “Resource Budget”| Component | Instances | CPU Req | Mem Req | Storage |
|---|---|---|---|---|
| Prometheus Operator | 1 | 100m | 128Mi | — |
| Prometheus | 1 | 200m | 512Mi | 10Gi NFS |
| Alertmanager | 1 | 50m | 64Mi | 1Gi NFS |
| Grafana | 1 | 100m | 128Mi | 1Gi NFS |
| kube-state-metrics | 1 | 50m | 64Mi | — |
| Node Exporter | 4 (DaemonSet) | 50m x4 | 32Mi x4 | — |
| Blackbox Exporter | 1 | 25m | 32Mi | — |
| Loki | 1 | 250m | 512Mi | 20Gi NFS |
| Grafana Alloy | 4 (DaemonSet) | 100m x4 | 128Mi x4 | — |
| Alloy Logpush | 1 | 50m | 64Mi | — |
| Jaeger | 1 | 250m | 512Mi | 10Gi NFS |
| Total | ~1.68 cores | ~2.59Gi | ~42Gi |
File Reference
Section titled “File Reference”monitoring/ kustomization.yaml # Top-level: composes all components namespace.yaml
operator/ # Prometheus Operator v0.89.0 kustomization.yaml crd-*.yaml # 10 CRDs (~3.7 MB) serviceaccount.yaml clusterrole.yaml / clusterrolebinding.yaml deployment.yaml / service.yaml / servicemonitor.yaml webhook.yaml # Cert-gen Jobs + webhook configs
prometheus/ # Prometheus v3.9.1 (operator-managed) kustomization.yaml prometheus.yaml # Prometheus CR serviceaccount.yaml / clusterrole.yaml / clusterrolebinding.yaml service.yaml / servicemonitor.yaml rules/ general-rules.yaml # Watchdog, TargetDown kubernetes-apps.yaml # CrashLoopBackOff, restarts kubernetes-resources.yaml # CPU/memory quota node-rules.yaml # Filesystem, memory, CPU k8s-recording-rules.yaml # Pre-computed recording rules traefik-rules.yaml # Traefik alerts
alertmanager/ # Alertmanager v0.31.1 alertmanager.yaml # Alertmanager CR secret.yaml # SOPS-encrypted SMTP config
grafana/ # Grafana 12.3.3 configmap.yaml # grafana.ini + dashboard provider datasources.yaml # Prometheus, Loki, Jaeger, Alertmanager deployment.yaml # Grafana + k8s-sidecar secret.yaml # SOPS-encrypted credentials dashboards/ kustomization.yaml # configMapGenerator (16 dashboards) add-dashboard.sh # Download from grafana.com gen-cloudflare-logpush.py # 135-panel dashboard generator (--export for grafana.com) gen-cloudflared.py # 67-panel dashboard generator (--export for grafana.com) country_codes.py # 249 ISO 3166-1 Alpha-2 → country name *.json # 16 dashboard files
loki/ # Loki 3.6.5 (monolithic) configmap.yaml / statefulset.yaml / service.yaml / servicemonitor.yaml
alloy/ # Grafana Alloy v1.13.1 (DaemonSet) configmap.yaml / daemonset.yaml / service.yaml / servicemonitor.yaml
alloy-logpush/ # Alloy Logpush receiver (Deployment) configmap.yaml / deployment.yaml / service.yaml / servicemonitor.yaml
jaeger/ # Jaeger 2.15.1 (all-in-one) configmap.yaml / deployment.yaml / service.yaml / servicemonitor.yaml
kube-state-metrics/ # v2.18.0 node-exporter/ # v1.10.2 (DaemonSet) blackbox-exporter/ # v0.28.0
servicemonitors/ # Cross-namespace ServiceMonitors apiserver.yaml / authentik.yaml / cloudflared.yaml coredns.yaml / kubelet.yaml / revista.yaml / traefik.yaml
probes/ vyos-scrape.yaml # ScrapeConfig for VyOS node_exporter
middleware/ # Traefik decompress plugin decompress-plugin/ decompress.go / go.mod / .traefik.yml decompress-configmap.yaml # ConfigMap for k8s decompress-middleware.yaml # Middleware CRD
ingressroutes/ grafana-ingress.yaml / prometheus-ingress.yaml alertmanager-ingress.yaml / jaeger-ingress.yaml alloy-logpush-ingress.yaml
hpa/ grafana-keda-autoscaling.yaml # maxReplicas: 1 (SQLite limitation) prom-keda-autoscaling.yaml # maxReplicas: 8
cloudflare-tunnel-tf/ # OpenTofu: DNS + tunnel ingress rules records.tf / tunnel_config.tf
cloudflare-tf/main_zone/ # OpenTofu: Logpush jobs zone_logpush_job.tf / locals.tf / variables.tf