All guides

Filter by category, difficulty, or free text to find the right material for your team.

Reliability~45 min

Design multi-window SLO burn-rate alerts without paging people for noise

Created: July 11, 2026 · Published: July 11, 2026

Learn how to define SLIs, error budget, PromQL, and promtool tests for burn-rate alerts that catch fast incidents without turning every short spike into a broken on-call shift.

Linux

Intermediate→

Metrics~40 min

Build golden signals for Kubernetes HTTP APIs without cardinality sprawl

Created: July 7, 2026 · Published: July 7, 2026

Define a label contract, create RED/USE recording rules, validate cardinality with promtool, and roll out Kubernetes API dashboards without turning every route into a series factory.

LinuxDocker

Intermediate→

Metrics~45 min

Guard Prometheus recording rules before cardinality spikes melt remote write

Created: July 3, 2026 · Published: July 3, 2026

Audit Prometheus recording rules, set a label budget, validate with promtool, and promote without melting remote write queues or dashboards.

LinuxDocker

Advanced→

Logs~45 min

Reconstruct AI search incident timelines without making up the story

Created: July 1, 2026 · Published: July 1, 2026

Turn scattered telemetry into a verifiable timeline: incident window, falsifiable hypotheses, trace anchors, correlated logs, saturation metrics, and mitigation validation.

LinuxDocker

Intermediate→

Metrics~35 min

Tame Kubernetes target churn in Prometheus before scrape pools stall

Created: June 29, 2026 · Published: June 29, 2026

A practical guide to reducing Kubernetes target churn in Prometheus with PromQL, canary relabeling, promtool, and rollout guardrails.

LinuxDocker

Advanced→

Metrics~35 min

Clean up kube-state-metrics noise before device metrics trigger Prometheus cardinality spikes

Created: June 26, 2026 · Published: June 26, 2026

Find which kube-state-metrics families inflate Prometheus, prune volatile labels with allowlists, and validate alerts and dashboards before rolling the change out.

Linux

Intermediate→

Reliability~35 min

Design burn-rate alerts with Kubernetes PSI before latency pages anyone

Created: June 24, 2026 · Published: June 24, 2026

A practical guide to turning burn-rate alerts into actionable signals: clear SLOs, multiple windows, PSI as context, promtool tests, and gradual rollout.

Linux

Intermediate→

Metrics~35 min

Diagnose Kubernetes API WATCH/LIST pressure before blaming etcd

Created: June 23, 2026 · Published: June 23, 2026

An actionable guide for detecting Kubernetes API WATCH/LIST pressure by combining API server, etcd, Prometheus, and client validation before timeouts spread.

Linux

Advanced→

Metrics~40 min

Enrich Prometheus metrics with info() without triggering cardinality spikes

Created: June 22, 2026 · Published: June 22, 2026

A practical guide to testing Prometheus’ experimental info() function, building a metadata allowlist, measuring series impact, and proving alerts and SLOs still work.

LinuxDocker

Intermediate→

Logs~45 min

Control Vector cardinality before remote write and sinks backpressure

Created: June 19, 2026 · Published: June 19, 2026

A practical guide to contain explosive tags in Vector 0.56, separate cardinality from retries and buffers, and prove that Prometheus, logs, and SLOs still tell the truth.

Linux

Advanced→

Metrics~40 min

Debug Prometheus relabeling when targets disappear without creating cardinality spikes

Created: June 14, 2026 · Published: June 14, 2026

Fix Prometheus relabeling rules that remove useful targets or keep unstable labels, without breaking alerts or inflating the TSDB.

Linux

Intermediate→

Metrics~35 min

Use Kubernetes PSI metrics to detect real API saturation without noisy paging

Created: June 12, 2026 · Published: June 12, 2026

A practical guide to adding PSI to Kubernetes API dashboards, correlating CPU/memory/IO pressure with SLIs, and designing alerts that page only on real impact.

Linux

Intermediate→

Reliability~35 min

Design SLO burn-rate alerts that page on real impact, not noise

Created: June 11, 2026 · Published: June 11, 2026

Learn how to define a useful SLI, combine multi-window burn alerts, control cardinality, and validate that an SLO page fires only on real impact.

Linux

Intermediate→

Metrics~35 min

Clean up kube-state-metrics noise before your dashboards and alerts start lying

Created: June 10, 2026 · Published: June 10, 2026

kube-state-metrics can turn Kubernetes state into a storm of series, labels, and dashboards that look precise but do not help. This guide shows how to measure the noise, remove unstable labels, keep critical signals, and validate Prometheus before touching production.

LinuxDocker

Intermediate→

Logs~40 min

Build incident timelines from logs, metrics, and traces without making up the missing parts

Created: May 2, 2026 · Published: May 2, 2026

Learn how to reconstruct a real incident from Prometheus, Loki, and distributed traces without getting lost in noise, clock skew, or confident storytelling.

LinuxDocker

Intermediate→

Metrics~40 min

Reduce Prometheus cardinality spikes without blinding your alerts

Created: April 30, 2026 · Published: April 30, 2026

A practical guide to spotting when Prometheus is swelling because of unstable labels, cutting cardinality in the right layer, and validating that your alerts still cover the real incident.

LinuxDocker

Advanced→

Metrics~40 min

Understand Kubernetes memory metrics without firing false OOM alerts

Created: April 26, 2026 · Published: April 26, 2026

A practical guide to diagnosing Kubernetes container memory with Prometheus and Grafana without confusing usage, working set, RSS, or reclaimable page cache.

LinuxDocker

Intermediate→

Metrics~35 min

Build useful golden signals for Kubernetes APIs without triggering Prometheus cardinality traps

Created: April 26, 2026 · Published: April 26, 2026

A practical guide to traffic, latency, errors, and saturation for Kubernetes APIs without filling Prometheus with useless series or breaking alerts.

LinuxDocker

Intermediate→

Reliability~36 min

Design burn rate alerts that do not wake people up for sport

Created: April 22, 2026 · Published: April 22, 2026

Recent Prometheus, OpenTelemetry Collector, Loki, and Alloy releases all point to the same uncomfortable truth: alerts wired straight to raw metrics and fragile labels become noisy or broken very easily. This guide shows how to anchor burn-rate alerts on stable recording rules, validate them with promtool, and roll them out without turning every short spike into a fake emergency.

LinuxDocker

Intermediate→

Metrics~38 min

Clean up kube-state-metrics noise so your dashboards mean something again

Created: April 20, 2026 · Published: April 20, 2026

kube-state-metrics is still valuable, but in 2026 it exposes more surface area, more stable metrics, and recent defaults such as EndpointSlices. If your dashboards filled up with irrelevant series, fragile joins, or duplicated states, this guide shows how to reduce noise at the source, fix your queries, and validate that the cleanup does not break alerts or troubleshooting.

LinuxDocker

Intermediate→

Logs~42 min

Build incident timelines with logs, metrics, and traces without making up the story

Created: April 19, 2026 · Published: April 19, 2026

Recent ecosystem signals point in the same direction: more telemetry does not automatically produce better diagnosis. This guide turns that into a repeatable method for building trustworthy timelines with Prometheus, Loki, and OpenTelemetry, while avoiding very current traps such as misleading memory metrics, noisy labels, and traces that lack useful request context.

LinuxDocker

Intermediate→

Metrics~45 min

Practical golden signals for APIs on Kubernetes without inflating the stack

Created: April 15, 2026 · Published: April 15, 2026

A technical, actionable guide to map golden signals to Prometheus metrics and PromQL, build Grafana panels, create alerts and follow a reproducible troubleshooting flow for Kubernetes APIs without adding unnecessary agents.

LinuxDocker

Intermediate→

Metrics~60 min

Reducing Prometheus cardinality spikes without breaking alerts

Created: April 11, 2026 · Published: April 11, 2026

A hands-on guide to detect high-cardinality sources, apply safe relabeling and rollups, and confirm critical alerts remain effective.

DockerLinux

Advanced→

Metrics~32 min

Metric downsampling with VictoriaMetrics in the free version

Created: April 10, 2026 · Published: April 10, 2026

VictoriaMetrics Enterprise provides native downsampling in cluster. On the free tier, you can approximate it with separate clusters, fan-out, and `-dedup.minScrapeInterval`.

Advanced→

Metrics~16 min

Prometheus for system metrics (cross-platform)

Created: April 5, 2026 · Published: April 5, 2026

Deploy Prometheus and validate operational metrics with a reproducible workflow, independent of your operating system.

Docker

Beginner→