All guides

Filter by category, difficulty, or free text to find the right material for your team.

Logs~40 min

Build incident timelines from logs, metrics, and traces without making up the missing parts

Created: May 2, 2026 · Published: May 2, 2026

Learn how to reconstruct a real incident from Prometheus, Loki, and distributed traces without getting lost in noise, clock skew, or confident storytelling.

LinuxDocker
Intermediate
Read guide
Logs~40 min

Debug Vector pipelines before retries and buffers hide the real bottleneck

Created: May 1, 2026 · Published: May 1, 2026

A practical guide to using Vector internal metrics, config validation, and isolation tests when logs arrive late, get retried too often, or vanish.

LinuxDocker
Intermediate
Read guide
Metrics~40 min

Reduce Prometheus cardinality spikes without blinding your alerts

Created: April 30, 2026 · Published: April 30, 2026

A practical guide to spotting when Prometheus is swelling because of unstable labels, cutting cardinality in the right layer, and validating that your alerts still cover the real incident.

LinuxDocker
Advanced
Read guide
Logs~40 min

Fix Loki label explosion without breaking the searches that actually matter

Created: April 29, 2026 · Published: April 29, 2026

A practical guide to spotting high-cardinality labels in Loki, taking them out of the hot path, and proving your searches still work.

LinuxDocker
Advanced
Read guide
Tracing~45 min

Tune distributed trace sampling without going blind when the incident lands

Created: April 28, 2026 · Published: April 28, 2026

A practical guide to spotting sampling bias, applying useful tail sampling, and validating that the traces you care about still survive the full path.

LinuxDocker
Advanced
Read guide
Logs~40 min

Diagnose hot shards in OpenSearch before latency and indexing queues spiral

Created: April 27, 2026 · Published: April 27, 2026

Learn how to confirm an OpenSearch hot shard, find the affected index and node, fix the actual cause, and validate that recovery is real.

LinuxDocker
Advanced
Read guide
Metrics~40 min

Understand Kubernetes memory metrics without firing false OOM alerts

Created: April 26, 2026 · Published: April 26, 2026

A practical guide to diagnosing Kubernetes container memory with Prometheus and Grafana without confusing usage, working set, RSS, or reclaimable page cache.

LinuxDocker
Intermediate
Read guide
Metrics~35 min

Build useful golden signals for Kubernetes APIs without triggering Prometheus cardinality traps

Created: April 26, 2026 · Published: April 26, 2026

A practical guide to traffic, latency, errors, and saturation for Kubernetes APIs without filling Prometheus with useless series or breaking alerts.

LinuxDocker
Intermediate
Read guide
OpenTelemetry~35 min

Diagnosing backpressure in the OpenTelemetry Collector before you start losing telemetry

Created: April 25, 2026 · Published: April 25, 2026

An advanced troubleshooting guide to isolate whether the choke point is the exporter, the network, the backend, or the Collector process itself before telemetry starts dropping.

DockerLinux
Advanced
Read guide
Reliability~36 min

Design burn rate alerts that do not wake people up for sport

Created: April 22, 2026 · Published: April 22, 2026

Recent Prometheus, OpenTelemetry Collector, Loki, and Alloy releases all point to the same uncomfortable truth: alerts wired straight to raw metrics and fragile labels become noisy or broken very easily. This guide shows how to anchor burn-rate alerts on stable recording rules, validate them with promtool, and roll them out without turning every short spike into a fake emergency.

LinuxDocker
Intermediate
Read guide
Tracing~42 min

Tune distributed sampling without going blind when it hurts most

Created: April 19, 2026 · Published: April 19, 2026

Recent ecosystem signals point to the same issue: poorly designed sampling still breaks diagnosis during real incidents. Between recent OpenTelemetry Collector changes, stricter validation, and trace backends that are still sensitive to series growth, queue pressure, and exemplars, this guide shows a practical way to reduce cost without losing the traces that matter.

LinuxDocker
Advanced
Read guide
Logs~42 min

Build incident timelines with logs, metrics, and traces without making up the story

Created: April 19, 2026 · Published: April 19, 2026

Recent ecosystem signals point in the same direction: more telemetry does not automatically produce better diagnosis. This guide turns that into a repeatable method for building trustworthy timelines with Prometheus, Loki, and OpenTelemetry, while avoiding very current traps such as misleading memory metrics, noisy labels, and traces that lack useful request context.

LinuxDocker
Intermediate
Read guide
Logs~55 min

Resolve hot shards in OpenSearch before the cluster starts melting

Created: April 19, 2026 · Published: April 19, 2026

An advanced guide to isolating hot shards in OpenSearch with node, shard, and ingest signals, then applying reversible mitigations before queues, timeouts, and backlogs take over.

LinuxDocker
Advanced
Read guide
Logs~24 min

Debug Vector pipelines when logs arrive late, broken, or not at all

Created: April 17, 2026 · Published: April 17, 2026

When a Vector pipeline starts delaying, duplicating, or dropping events, random tuning is usually the expensive path. This guide shows how to use internal metrics, config validation, and sink-side signals to find the real bottleneck and fix it with reversible changes.

DockerLinux
Intermediate
Read guide
Metrics~45 min

Practical golden signals for APIs on Kubernetes without inflating the stack

Created: April 15, 2026 · Published: April 15, 2026

A technical, actionable guide to map golden signals to Prometheus metrics and PromQL, build Grafana panels, create alerts and follow a reproducible troubleshooting flow for Kubernetes APIs without adding unnecessary agents.

LinuxDocker
Intermediate
Read guide
Logs~60 min

What to do when Loki sinks from label cardinality explosion

Created: April 13, 2026 · Published: April 13, 2026

Actionable guide to detect and fix high-cardinality labels that degrade or crash Loki: symptoms, metrics and logs to inspect, safe Promtail/ingest changes and validation steps.

DockerLinux
Advanced
Read guide
Metrics~60 min

Reducing Prometheus cardinality spikes without breaking alerts

Created: April 11, 2026 · Published: April 11, 2026

A hands-on guide to detect high-cardinality sources, apply safe relabeling and rollups, and confirm critical alerts remain effective.

DockerLinux
Advanced
Read guide
OpenTelemetry~35 min

Diagnosing backpressure in the OpenTelemetry Collector before you start losing telemetry

Created: April 10, 2026 · Published: April 10, 2026

An advanced troubleshooting guide to isolate whether the choke point is the exporter, the network, the backend, or the Collector process itself before telemetry starts dropping.

DockerLinux
Advanced
Read guide
Metrics~32 min

Metric downsampling with VictoriaMetrics in the free version

Created: April 10, 2026 · Published: April 10, 2026

VictoriaMetrics Enterprise provides native downsampling in cluster. On the free tier, you can approximate it with separate clusters, fan-out, and `-dedup.minScrapeInterval`.

Advanced
Read guide
Logs~28 min

Size OpenSearch shards from real ingestion

Created: April 9, 2026 · Published: April 9, 2026

Advanced guide for choosing `number_of_shards` and `max_size` from the real ingestion rate of an index.

Advanced
Read guide
Tracing~10 min

Detect deployment regressions with traces

Created: April 2, 2026 · Published: April 2, 2026

What to compare, which attributes to segment, and which spans to inspect when a deployment might have introduced latency or errors.

Intermediate
Read guide
Reliability~12 min

SLO design for platform teams

Created: March 30, 2026 · Published: March 30, 2026

A short framework for choosing indicators and targets that help you negotiate reliability with product and engineering.

Intermediate
Read guide