All guides

Filter by category, difficulty, or free text to find the right material for your team.

Logs~40 min

Debug Vector pipelines before retries and buffers hide the real bottleneck

Created: May 1, 2026 · Published: May 1, 2026

A practical guide to using Vector internal metrics, config validation, and isolation tests when logs arrive late, get retried too often, or vanish.

LinuxDocker
Intermediate
Read guide
Metrics~40 min

Reduce Prometheus cardinality spikes without blinding your alerts

Created: April 30, 2026 · Published: April 30, 2026

A practical guide to spotting when Prometheus is swelling because of unstable labels, cutting cardinality in the right layer, and validating that your alerts still cover the real incident.

LinuxDocker
Advanced
Read guide
Logs~40 min

Fix Loki label explosion without breaking the searches that actually matter

Created: April 29, 2026 · Published: April 29, 2026

A practical guide to spotting high-cardinality labels in Loki, taking them out of the hot path, and proving your searches still work.

LinuxDocker
Advanced
Read guide
Logs~40 min

Diagnose hot shards in OpenSearch before latency and indexing queues spiral

Created: April 27, 2026 · Published: April 27, 2026

Learn how to confirm an OpenSearch hot shard, find the affected index and node, fix the actual cause, and validate that recovery is real.

LinuxDocker
Advanced
Read guide
Metrics~40 min

Understand Kubernetes memory metrics without firing false OOM alerts

Created: April 26, 2026 · Published: April 26, 2026

A practical guide to diagnosing Kubernetes container memory with Prometheus and Grafana without confusing usage, working set, RSS, or reclaimable page cache.

LinuxDocker
Intermediate
Read guide
OpenTelemetry~35 min

Diagnosing backpressure in the OpenTelemetry Collector before you start losing telemetry

Created: April 25, 2026 · Published: April 25, 2026

An advanced troubleshooting guide to isolate whether the choke point is the exporter, the network, the backend, or the Collector process itself before telemetry starts dropping.

DockerLinux
Advanced
Read guide
Metrics~38 min

Clean up kube-state-metrics noise so your dashboards mean something again

Created: April 20, 2026 · Published: April 20, 2026

kube-state-metrics is still valuable, but in 2026 it exposes more surface area, more stable metrics, and recent defaults such as EndpointSlices. If your dashboards filled up with irrelevant series, fragile joins, or duplicated states, this guide shows how to reduce noise at the source, fix your queries, and validate that the cleanup does not break alerts or troubleshooting.

LinuxDocker
Intermediate
Read guide
Logs~55 min

Resolve hot shards in OpenSearch before the cluster starts melting

Created: April 19, 2026 · Published: April 19, 2026

An advanced guide to isolating hot shards in OpenSearch with node, shard, and ingest signals, then applying reversible mitigations before queues, timeouts, and backlogs take over.

LinuxDocker
Advanced
Read guide
Logs~24 min

Debug Vector pipelines when logs arrive late, broken, or not at all

Created: April 17, 2026 · Published: April 17, 2026

When a Vector pipeline starts delaying, duplicating, or dropping events, random tuning is usually the expensive path. This guide shows how to use internal metrics, config validation, and sink-side signals to find the real bottleneck and fix it with reversible changes.

DockerLinux
Intermediate
Read guide
Logs~60 min

What to do when Loki sinks from label cardinality explosion

Created: April 13, 2026 · Published: April 13, 2026

Actionable guide to detect and fix high-cardinality labels that degrade or crash Loki: symptoms, metrics and logs to inspect, safe Promtail/ingest changes and validation steps.

DockerLinux
Advanced
Read guide
Metrics~60 min

Reducing Prometheus cardinality spikes without breaking alerts

Created: April 11, 2026 · Published: April 11, 2026

A hands-on guide to detect high-cardinality sources, apply safe relabeling and rollups, and confirm critical alerts remain effective.

DockerLinux
Advanced
Read guide
OpenTelemetry~35 min

Diagnosing backpressure in the OpenTelemetry Collector before you start losing telemetry

Created: April 10, 2026 · Published: April 10, 2026

An advanced troubleshooting guide to isolate whether the choke point is the exporter, the network, the backend, or the Collector process itself before telemetry starts dropping.

DockerLinux
Advanced
Read guide
Metrics~32 min

Metric downsampling with VictoriaMetrics in the free version

Created: April 10, 2026 · Published: April 10, 2026

VictoriaMetrics Enterprise provides native downsampling in cluster. On the free tier, you can approximate it with separate clusters, fan-out, and `-dedup.minScrapeInterval`.

Advanced
Read guide
Logs~28 min

Size OpenSearch shards from real ingestion

Created: April 9, 2026 · Published: April 9, 2026

Advanced guide for choosing `number_of_shards` and `max_size` from the real ingestion rate of an index.

Advanced
Read guide
Dashboards~20 min

Grafana to unify metrics, logs, and traces (cross-platform)

Created: April 7, 2026 · Published: April 7, 2026

Deploy Grafana, provision datasources, and keep one place to explore metrics, logs, and trace correlation.

Docker
Intermediate
Read guide
Logs~18 min

OpenSearch for centralized logs (cross-platform)

Created: April 6, 2026 · Published: April 6, 2026

Configure OpenSearch and Dashboards, load initial documents, and validate operational log-search workflows on any platform.

Docker
Beginner
Read guide
Metrics~16 min

Prometheus for system metrics (cross-platform)

Created: April 5, 2026 · Published: April 5, 2026

Deploy Prometheus and validate operational metrics with a reproducible workflow, independent of your operating system.

Docker
Beginner
Read guide
Reliability~12 min

SLO design for platform teams

Created: March 30, 2026 · Published: March 30, 2026

A short framework for choosing indicators and targets that help you negotiate reliability with product and engineering.

Intermediate
Read guide
OpenTelemetry~18 min

Observability foundations with OpenTelemetry

Created: March 21, 2026 · Published: March 21, 2026

A practical guide for moving from instrumentation by fashion to instrumentation that answers real operational questions.

Beginner
Read guide