All guides

Filter by category, difficulty, or free text to find the right material for your team.

Logs~35 min

Fix Loki label cardinality before incident logs disappear

Created: July 12, 2026 · Published: July 12, 2026

Learn how to detect explosive Loki labels, move volatile context out of the index, and validate that critical logs remain available during incidents.

LinuxDocker

Advanced→

Reliability~45 min

Design multi-window SLO burn-rate alerts without paging people for noise

Created: July 11, 2026 · Published: July 11, 2026

Learn how to define SLIs, error budget, PromQL, and promtool tests for burn-rate alerts that catch fast incidents without turning every short spike into a broken on-call shift.

Linux

Intermediate→

OpenTelemetry~45 min

Detect ingestion backpressure between OpenTelemetry, Data Prepper, and OpenSearch before telemetry drops

Created: July 10, 2026 · Published: July 10, 2026

Validate Collector, Data Prepper, and OpenSearch with parity, queue, and rejection metrics so traces, logs, or metrics do not disappear before you can explain the incident.

LinuxDocker

Advanced→

Logs~45 min

Debug Vector pipelines when logs arrive late, broken, or never arrive

Created: July 9, 2026 · Published: July 9, 2026

Validate a Vector pipeline with evidence: config, VRL, tap, internal metrics, backpressure, canaries, and rollback criteria.

LinuxDocker

Intermediate→

Tracing~45 min

Tune OpenTelemetry sampling without losing incident traces to broken wrappers

Created: July 8, 2026 · Published: July 8, 2026

Learn how to detect instrumentation wrappers that hide useful attributes, design tail sampling policies, and prove important traces still arrive before reducing cost.

LinuxDocker

Advanced→

Metrics~40 min

Build golden signals for Kubernetes HTTP APIs without cardinality sprawl

Created: July 7, 2026 · Published: July 7, 2026

Define a label contract, create RED/USE recording rules, validate cardinality with promtool, and roll out Kubernetes API dashboards without turning every route into a series factory.

LinuxDocker

Intermediate→

Logs~45 min

Diagnose OpenSearch vector-search hot shards before RAG latency spikes

Created: July 6, 2026 · Published: July 6, 2026

Learn how to detect hot shards in OpenSearch vector indices, prove whether they affect RAG retrieval, and apply safe mitigations with Dev Tools, metrics, and canaries.

LinuxDocker

Advanced→

Metrics~45 min

Guard Prometheus recording rules before cardinality spikes melt remote write

Created: July 3, 2026 · Published: July 3, 2026

Audit Prometheus recording rules, set a label budget, validate with promtool, and promote without melting remote write queues or dashboards.

LinuxDocker

Advanced→

OpenTelemetry~50 min

Roll out OpenTelemetry Collector declarative config without dropping telemetry

Created: July 2, 2026 · Published: July 2, 2026

Move to OpenTelemetry Collector declarative configuration with dry-runs, canaries, queue/exporter metrics, PromQL tests, and a prepared rollback.

LinuxDocker

Advanced→

Logs~45 min

Reconstruct AI search incident timelines without making up the story

Created: July 1, 2026 · Published: July 1, 2026

Turn scattered telemetry into a verifiable timeline: incident window, falsifiable hypotheses, trace anchors, correlated logs, saturation metrics, and mitigation validation.

LinuxDocker

Intermediate→

Logs~45 min

Deduplicate OpenTelemetry logs before Loki limits hide incident evidence

Created: June 30, 2026 · Published: June 30, 2026

Learn how to detect duplicates, apply a safe OpenTelemetry Collector deduplication policy, and prove Loki stops discarding samples without losing incident evidence.

Linux

Advanced→

Metrics~35 min

Tame Kubernetes target churn in Prometheus before scrape pools stall

Created: June 29, 2026 · Published: June 29, 2026

A practical guide to reducing Kubernetes target churn in Prometheus with PromQL, canary relabeling, promtool, and rollout guardrails.

LinuxDocker

Advanced→

OpenTelemetry~45 min

Diagnose OpenTelemetry Collector backpressure before signals are dropped

Created: June 28, 2026 · Published: June 28, 2026

Learn how to read OpenTelemetry Collector internal metrics, isolate slow exporters, test a safe canary, and prove queues drain without hiding critical spans, logs, or metrics.

Linux

Advanced→

Logs~45 min

Diagnose OpenSearch hot shards before indexing queues spiral

Created: June 27, 2026 · Published: June 27, 2026

Learn how to find OpenSearch hot shards with Dev Tools, queue metrics, and canary validation before search latency or bulk rejections ruin the morning.

Linux

Advanced→

Logs~45 min

Debug Vector pipeline backpressure before buffers become log loss

Created: June 25, 2026 · Published: June 25, 2026

Learn how to diagnose delayed and dropped logs in Vector with internal metrics, `vector top`, PromQL rules, disk buffers, and a safe canary before changing production.

LinuxDocker

Advanced→

Reliability~35 min

Design burn-rate alerts with Kubernetes PSI before latency pages anyone

Created: June 24, 2026 · Published: June 24, 2026

A practical guide to turning burn-rate alerts into actionable signals: clear SLOs, multiple windows, PSI as context, promtool tests, and gradual rollout.

Linux

Intermediate→

Metrics~35 min

Diagnose Kubernetes API WATCH/LIST pressure before blaming etcd

Created: June 23, 2026 · Published: June 23, 2026

An actionable guide for detecting Kubernetes API WATCH/LIST pressure by combining API server, etcd, Prometheus, and client validation before timeouts spread.

Linux

Advanced→

Metrics~40 min

Enrich Prometheus metrics with info() without triggering cardinality spikes

Created: June 22, 2026 · Published: June 22, 2026

A practical guide to testing Prometheus’ experimental info() function, building a metadata allowlist, measuring series impact, and proving alerts and SLOs still work.

LinuxDocker

Intermediate→

Tracing~45 min

Tune OTel tail sampling for GenAI traces without losing incident evidence

Created: June 21, 2026 · Published: June 21, 2026

A practical guide to rolling out OpenTelemetry Collector tail sampling when GenAI and RAG traces grow quickly: policies, canaries, queue metrics, and critical-evidence validation.

Linux

Advanced→

Logs~40 min

Troubleshoot Loki ingest limits and label policies before tenant rate limits drop logs

Created: June 20, 2026 · Published: June 20, 2026

Learn how to investigate Loki drops caused by label policies, rate limits, and excessive streams, then clean volatile labels and validate that critical searches still work.

Linux

Advanced→

Logs~45 min

Control Vector cardinality before remote write and sinks backpressure

Created: June 19, 2026 · Published: June 19, 2026

A practical guide to contain explosive tags in Vector 0.56, separate cardinality from retries and buffers, and prove that Prometheus, logs, and SLOs still tell the truth.

Linux

Advanced→

OpenTelemetry~40 min

Use OTTL context inference in the Filter Processor without dropping critical telemetry

Created: June 18, 2026 · Published: June 18, 2026

A practical guide to filtering noisy logs, metrics, and traces with OTTL context inference while proving that errors, SLO signals, and incident evidence are still present.

Linux

Advanced→

OpenTelemetry~40 min

Validate OTel-Arrow without losing telemetry when pressure hits

Created: June 17, 2026 · Published: June 17, 2026

OTel-Arrow can make telemetry pipelines more efficient, but a safe rollout is not a benchmark victory lap. Prove that pressure is visible, queues drain, and hidden drops stay at zero.

Linux

Advanced→

Tracing~35 min

Use OBI header enrichment to scope incidents without leaking secrets

Created: June 16, 2026 · Published: June 16, 2026

Configure OpenTelemetry eBPF Instrumentation to enrich traces with useful headers, obfuscate credentials, and validate that incident response gains context without exposing sensitive data.

Linux

Advanced→

Logs~42 min

Stop log bursts from turning OpenSearch rollover into hot shards

Created: June 15, 2026 · Published: June 15, 2026

Learn how to detect when OpenSearch log indices create hot shards during ingestion bursts, fix rollover with aliases and ISM, and validate recovery with real signals.

LinuxDocker

Advanced→

Metrics~40 min

Debug Prometheus relabeling when targets disappear without creating cardinality spikes

Created: June 14, 2026 · Published: June 14, 2026

Fix Prometheus relabeling rules that remove useful targets or keep unstable labels, without breaking alerts or inflating the TSDB.

Linux

Intermediate→

OpenTelemetry~40 min

Deduplicate logs in the OpenTelemetry Collector before queues start dropping telemetry

Created: June 13, 2026 · Published: June 13, 2026

Learn how to detect log pipeline pressure in the OpenTelemetry Collector, apply logdedup with safe conditions, and validate that queues, memory, and exporters stabilize without breaking audit streams or alerts.

LinuxDocker

Advanced→

Metrics~35 min

Use Kubernetes PSI metrics to detect real API saturation without noisy paging

Created: June 12, 2026 · Published: June 12, 2026

A practical guide to adding PSI to Kubernetes API dashboards, correlating CPU/memory/IO pressure with SLIs, and designing alerts that page only on real impact.

Linux

Intermediate→

Reliability~35 min

Design SLO burn-rate alerts that page on real impact, not noise

Created: June 11, 2026 · Published: June 11, 2026

Learn how to define a useful SLI, combine multi-window burn alerts, control cardinality, and validate that an SLO page fires only on real impact.

Linux

Intermediate→

Logs~40 min

Build incident timelines from logs, metrics, and traces without making up the missing parts

Created: May 2, 2026 · Published: May 2, 2026

Learn how to reconstruct a real incident from Prometheus, Loki, and distributed traces without getting lost in noise, clock skew, or confident storytelling.

LinuxDocker

Intermediate→

Logs~40 min

Debug Vector pipelines before retries and buffers hide the real bottleneck

Created: May 1, 2026 · Published: May 1, 2026

A practical guide to using Vector internal metrics, config validation, and isolation tests when logs arrive late, get retried too often, or vanish.

LinuxDocker

Intermediate→

Metrics~40 min

Reduce Prometheus cardinality spikes without blinding your alerts

Created: April 30, 2026 · Published: April 30, 2026

A practical guide to spotting when Prometheus is swelling because of unstable labels, cutting cardinality in the right layer, and validating that your alerts still cover the real incident.

LinuxDocker

Advanced→

Logs~40 min

Fix Loki label explosion without breaking the searches that actually matter

Created: April 29, 2026 · Published: April 29, 2026

A practical guide to spotting high-cardinality labels in Loki, taking them out of the hot path, and proving your searches still work.

LinuxDocker

Advanced→

Tracing~45 min

Tune distributed trace sampling without going blind when the incident lands

Created: April 28, 2026 · Published: April 28, 2026

A practical guide to spotting sampling bias, applying useful tail sampling, and validating that the traces you care about still survive the full path.

LinuxDocker

Advanced→

Logs~40 min

Diagnose hot shards in OpenSearch before latency and indexing queues spiral

Created: April 27, 2026 · Published: April 27, 2026

Learn how to confirm an OpenSearch hot shard, find the affected index and node, fix the actual cause, and validate that recovery is real.

LinuxDocker

Advanced→

Metrics~40 min

Understand Kubernetes memory metrics without firing false OOM alerts

Created: April 26, 2026 · Published: April 26, 2026

A practical guide to diagnosing Kubernetes container memory with Prometheus and Grafana without confusing usage, working set, RSS, or reclaimable page cache.

LinuxDocker

Intermediate→

Metrics~35 min

Build useful golden signals for Kubernetes APIs without triggering Prometheus cardinality traps

Created: April 26, 2026 · Published: April 26, 2026

A practical guide to traffic, latency, errors, and saturation for Kubernetes APIs without filling Prometheus with useless series or breaking alerts.

LinuxDocker

Intermediate→

OpenTelemetry~35 min

Diagnosing backpressure in the OpenTelemetry Collector before you start losing telemetry

Created: April 25, 2026 · Published: April 25, 2026

An advanced troubleshooting guide to isolate whether the choke point is the exporter, the network, the backend, or the Collector process itself before telemetry starts dropping.

DockerLinux

Advanced→

Reliability~36 min

Design burn rate alerts that do not wake people up for sport

Created: April 22, 2026 · Published: April 22, 2026

Recent Prometheus, OpenTelemetry Collector, Loki, and Alloy releases all point to the same uncomfortable truth: alerts wired straight to raw metrics and fragile labels become noisy or broken very easily. This guide shows how to anchor burn-rate alerts on stable recording rules, validate them with promtool, and roll them out without turning every short spike into a fake emergency.

LinuxDocker

Intermediate→

Tracing~42 min

Tune distributed sampling without going blind when it hurts most

Created: April 19, 2026 · Published: April 19, 2026

Recent ecosystem signals point to the same issue: poorly designed sampling still breaks diagnosis during real incidents. Between recent OpenTelemetry Collector changes, stricter validation, and trace backends that are still sensitive to series growth, queue pressure, and exemplars, this guide shows a practical way to reduce cost without losing the traces that matter.

LinuxDocker

Advanced→

Logs~42 min

Build incident timelines with logs, metrics, and traces without making up the story

Created: April 19, 2026 · Published: April 19, 2026

Recent ecosystem signals point in the same direction: more telemetry does not automatically produce better diagnosis. This guide turns that into a repeatable method for building trustworthy timelines with Prometheus, Loki, and OpenTelemetry, while avoiding very current traps such as misleading memory metrics, noisy labels, and traces that lack useful request context.

LinuxDocker

Intermediate→

Logs~55 min

Resolve hot shards in OpenSearch before the cluster starts melting

Created: April 19, 2026 · Published: April 19, 2026

An advanced guide to isolating hot shards in OpenSearch with node, shard, and ingest signals, then applying reversible mitigations before queues, timeouts, and backlogs take over.

LinuxDocker

Advanced→

Logs~24 min

Debug Vector pipelines when logs arrive late, broken, or not at all

Created: April 17, 2026 · Published: April 17, 2026

When a Vector pipeline starts delaying, duplicating, or dropping events, random tuning is usually the expensive path. This guide shows how to use internal metrics, config validation, and sink-side signals to find the real bottleneck and fix it with reversible changes.

DockerLinux

Intermediate→

Metrics~45 min

Practical golden signals for APIs on Kubernetes without inflating the stack

Created: April 15, 2026 · Published: April 15, 2026

A technical, actionable guide to map golden signals to Prometheus metrics and PromQL, build Grafana panels, create alerts and follow a reproducible troubleshooting flow for Kubernetes APIs without adding unnecessary agents.

LinuxDocker

Intermediate→

Logs~60 min

What to do when Loki sinks from label cardinality explosion

Created: April 13, 2026 · Published: April 13, 2026

Actionable guide to detect and fix high-cardinality labels that degrade or crash Loki: symptoms, metrics and logs to inspect, safe Promtail/ingest changes and validation steps.

DockerLinux

Advanced→

Metrics~60 min

Reducing Prometheus cardinality spikes without breaking alerts

Created: April 11, 2026 · Published: April 11, 2026

A hands-on guide to detect high-cardinality sources, apply safe relabeling and rollups, and confirm critical alerts remain effective.

DockerLinux

Advanced→

OpenTelemetry~35 min

Diagnosing backpressure in the OpenTelemetry Collector before you start losing telemetry

Created: April 10, 2026 · Published: April 10, 2026

An advanced troubleshooting guide to isolate whether the choke point is the exporter, the network, the backend, or the Collector process itself before telemetry starts dropping.

DockerLinux

Advanced→

Metrics~32 min

Metric downsampling with VictoriaMetrics in the free version

Created: April 10, 2026 · Published: April 10, 2026

VictoriaMetrics Enterprise provides native downsampling in cluster. On the free tier, you can approximate it with separate clusters, fan-out, and `-dedup.minScrapeInterval`.

Advanced→

Logs~28 min

Size OpenSearch shards from real ingestion

Created: April 9, 2026 · Published: April 9, 2026

Advanced guide for choosing `number_of_shards` and `max_size` from the real ingestion rate of an index.

Advanced→

Tracing~10 min

Detect deployment regressions with traces

Created: April 2, 2026 · Published: April 2, 2026

What to compare, which attributes to segment, and which spans to inspect when a deployment might have introduced latency or errors.

Intermediate→

Reliability~12 min

SLO design for platform teams

Created: March 30, 2026 · Published: March 30, 2026

A short framework for choosing indicators and targets that help you negotiate reliability with product and engineering.

Intermediate→