The Four Pillars of Observability: Logs, Metrics, Traces, and Wide Events
Why the traditional three-pillar model is broken, and what wide events do that no single tool can
1. The 3 AM Debugging Nightmare
It's 3 AM. PagerDuty wakes you up. Payment success rate dropped from 99.2% to 97.8% in the last 15 minutes.
You open three tabs:
- Grafana (metrics): CPU and memory are fine. P99 latency is spiking. But you don't know which requests are slow.
- Kibana (logs): You search "ERROR" in the last hour. 200 results. Half are irrelevant. The one that matters is buried under noise from a health check logger that never got fixed.
- Jaeger (traces): You find a trace for a failed payment. But the trace says "timeout" with no detail about what timed out. The root span is just a generic HTTP 500.
Three separate tools. Three separate views of the same incident. None of them connecting the dots.
This is the "three pillars of observability" — and it's not observability. It's three silos pretending to be one.
2. The Problem: Pillars Are Tool Boundaries, Not Engineering Needs
The three-pillar model (logs, metrics, traces) wasn't designed by engineers solving debugging problems. It was inherited from the tooling market:
- Metrics came from monitoring tools (Nagios, Prometheus) that needed to track resource utilization at scale.
- Logs came from debugging tools (Splunk, ELK) built for event-driven investigation.
- Traces came from distributed systems research (Dapper, Zipkin) that needed to follow requests across services.
Each tool solved a real problem — in isolation. But production incidents don't respect tool boundaries. An incident is a multi-dimensional puzzle, and the three pillars hand you three separate puzzle pieces that weren't designed to fit together.
The question isn't "which pillar should I use?" — it's "why can't I ask arbitrary questions about my production system without switching contexts?"
3. When This Applies
This post is for you if:
- You run 10+ microservices and incident debugging takes longer than it should — especially when the incident crosses service boundaries.
- Your on-call rotation dreads pagers not because of the interruptions, but because investigating takes 45+ minutes to piece together what happened.
- You've ever said "the logs are wrong" or "the trace doesn't show the full picture."
- You operate at 100K+ requests/second and cardinality is a real cost constraint — you can't log everything, so you log too little.
- You're rebuilding your observability stack or choosing between vendors and want a framework that won't be obsolete in three years.
4. The Core Framework: The Four Pillars
Here's what each pillar actually does — honestly, including where it falls short.
4.1 Logs: The Narrative
What they are: Unstructured or semi-structured text records of discrete events. "Request received." "DB query took 45ms." "Payment failed: insufficient funds."
What they're good for: Debugging specific one-off issues. When you know what to search for ("search for 'payment failed' in the last hour"), logs are the fastest path to an answer.
Where they fail: Logs have no structure for aggregation. You can't ask "show me the P95 latency by merchant_id" from log lines. You also can't correlate logs across services without trace IDs — which most teams add inconsistently.
The dirty secret: Most teams generate 95% noise logs and 5% signal. The noise buries the signal. Adding more logs makes debugging harder, not easier.
4.2 Metrics: The Dashboards
What they are: Aggregated numerical measurements over time. Request count, P99 latency, error rate, CPU utilization. Sampled and rolled up into time-series databases.
What they're good for: Spotting trends and anomalies. "Latency has been climbing for 30 minutes." "Error rate just doubled." — Metrics are the best early warning system.
Where they fail: Metrics tell you that something is wrong, but rarely what. A P99 spike could mean a bad deploy, a traffic surge, a downstream dependency failing, or a noisy neighbor. The metric points at the symptom, not the cause.
Metrics also lose context through aggregation. You don't know which endpoint, which user, which request caused the spike. Downsampling means you can't zoom in beyond a certain resolution.
4.3 Traces: The Request Journey
What they are: End-to-end records of a single request as it flows through services. Each span adds timing and metadata at a hop.
What they're good for: Finding bottlenecks in distributed request paths. "Why is this API call slow?" — Trace the request, find the span that took the longest, identify the slow service.
Where they fail: Sampling. At scale, you can't trace every request without bankrupting your storage budget. Most teams sample at 1-5%, which means for any given incident, the specific failing request probably wasn't traced.
Traces also have limited context per span. A span typically records service name, duration, status code, and maybe a few tags. That's not enough to answer questions like "was this request for a high-value merchant?" or "did this call happen during a deployment?"
4.4 Wide Events: The Unifier
What they are: High-dimensional structured events that carry every piece of context about a single request in one record. Not a log line, not a metric point, not a span — an event that combines all three.
A typical wide event looks like:
{
"timestamp": "2026-05-19T12:34:56Z",
"service": "payment-collector",
"endpoint": "/api/v1/charge",
"duration_ms": 342,
"status_code": 200,
"merchant_id": "m_abc123",
"amount": 150000,
"payment_method": "card",
"card_bin": "411111",
"is_retry": false,
"trace_id": "tr_xyz789",
"span_id": "sp_456",
"error": null,
"db_query_count": 4,
"db_total_ms": 87,
"cache_hit": true,
"deployment_id": "dep_20260519_1200",
"region": "ap-south-1",
"instance_id": "i-0a1b2c3d"
}
What they're good for: Asking arbitrary questions about your production system without knowing the answer upfront.
- "Show me all slow requests (duration > 500ms) by merchant tier."
- "Do retries correlate with a specific card BIN range?"
- "Is latency higher in one region after the last deploy?"
Where they excel: Wide events make the implicit explicit. Every event carries enough context to answer almost any question, including questions you didn't think to ask when you built the system.
Where they have friction: Storage cost and cardinality. Every event has 30-50+ dimensions. Storing that at scale is expensive — think Honeycomb or a purpose-built columnar store. Traditional SQL DBs choke on this volume.
5. The Real Pattern: Wide Events First, Everything Else on Top
Here's the framework that actually works in production, validated across multiple orgs handling 100K+ req/s:
Instrument wide events as your primary observability signal. Derive metrics from them. Keep traces for deep-dive debugging. Use logs as the fallback for edge cases.
In practice:
- Every request produces one wide event. Not one log line, one metric point, and one trace — one event that carries everything.
- Metrics are aggregate queries on the event stream. Your P99 latency dashboard is
SELECT quantile(duration_ms, 0.99) GROUP BY endpoint, region— not a separate instrumentation path. - Traces are reconstructed from events. Each event carries
trace_idandspan_id. You can still visualize a trace waterfall, but you don't need a separate tracing pipeline. - Logs are the safety valve. For truly exceptional situations (stack traces, debug dumps), emit a separate log. But 95% of what you currently log should be a field on the wide event.
This eliminates the three biggest observability problems:
- No more context switching between tools: One query interface for almost everything.
- No more sampling tradeoffs: You store wide events selectively (drop high-cardinality fields for cheap requests, keep everything for anomalies) — but you can ask any question.
- No more blind spots: The deployment_id field means you can correlate every incident with the exact deploy that triggered it. The merchant_id field means you can spot trends per customer before they churn.
6. Gotchas & How to Handle Them
Gotcha #1: Cost Blowup
Problem: Each wide event is 1-5KB of JSON. At 100K req/s, that's 100-500MB/s of raw event data. Storage costs spiral fast.
Why: You're storing 30-50 dimensions per event, many of which are strings (trace_id, merchant_id, deployment_id). The cardinality is enormous.
Fix: Three-tier retention:
- Hot (7 days): Full event — all dimensions, every request. Purpose-built columnar store (Honeycomb, ClickHouse).
- Warm (30 days): Sampled events — keep 100% of error/slow events, sample 10% of normal events. Drop low-value dimensions (instance_id, card_bin).
- Cold (1 year): Aggregated metrics only — pre-compute P50/P95/P99 by endpoint and merchant tier. No raw events.
Gotcha #2: Teams Ship Incomplete Events
Problem: Version 1 of the wide event has 15 fields. Version 2 adds 5 more. Version 3 renames duration to latency_ms. Now queries that used to work break silently.
Why: Wide events evolve fast. There's no schema registry enforcing what a "valid" event looks like.
Fix: Define the event schema as a protobuf or Avro at the platform level. Every service emits the same schema. New fields are additive-only (never rename or delete). Run a validation pipeline that rejects events that don't match the schema — don't let bad data silently pollute the stream.
Gotcha #3: Engineers Stop Writing Logs Altogether
Problem: The team adopts wide events, and suddenly nobody writes logs anymore. A year later, a weird edge case surfaces that needs the text narrative — and there's nothing to read.
Why: Wide events are structured data. Some debugging scenarios need unstructured text — stack traces, LLM prompts, raw SQL queries — that don't fit neatly into a key: value field.
Fix: The rule: "If you're about to console.log(something complex), put it in the wide event's debug field instead." Keep the debug field as a text field that's dropped from hot storage after 48 hours but kept in cold storage — present for the exact moment you need it, not polluting the main query path.
Gotcha #4: Event Cardinality Creates "Dimension Explosion"
Problem: someone adds user_agent as a dimension. Suddenly you have 10,000+ unique values for one field. Queries slow down, storage costs spike.
Why: High-cardinality dimensions break columnar stores differently than they break time-series DBs — but they break them nonetheless. Every unique value creates a new column group.
Fix: Classify dimensions into tiers:
- Tier 1 (always queryable): endpoint, service, status_code, duration, region — the 10 dimensions you query every time.
- Tier 2 (sometimes queryable): merchant_id, payment_method, deployment_id — useful for specific investigations, queried less than 10% of the time.
- Tier 3 (rarely queryable): user_agent, card_bin, raw SQL — stored but not indexed in hot storage. Only queryable after a 10-second sacrifice (the user waits while the query runs a full scan).
7. Constraints & Tradeoffs
What Wide Events Cannot Do Well
-
Real-time alerting at the event level: If you need sub-second alerting on every single event (e.g., "if a payment exceeds ₹10L, page someone immediately"), wide events add too much latency. Use a lightweight metrics path for real-time thresholds, then enrich with wide events for investigation.
-
Full-text search: Wide events are structured, not text. If your debugging workflow is "find me every event that mentions the word 'timeout' in any field" — that's still a log use case. Ensure your strategy doesn't eliminate text search entirely.
-
Tiny teams with simple systems: If you have 3 services and handle 1K req/s, wide events are overkill. You can debug with
tail -fand a simple metrics dashboard. Wide events pay off at complexity scale, not traffic scale.
What You Sacrifice
- Simplicity: One instrumentation path (logs) is simpler than a structured event pipeline with schema validation, hot/warm/cold tiers, and a wide-column query engine. You pay in operational complexity for investigative power.
- Toolchain maturity: Log aggregators (ELK, Loki) and metrics stacks (Prometheus + Grafana) are battle-tested with massive ecosystems. Wide-event-native tools (Honeycomb, SigNoz) are newer and have smaller communities.
- Team training: Every engineer knows how to grep logs and read a Grafana dashboard. Fewer know how to query a wide-event store with high-dimensional filters. You'll spend time teaching the query language.
When to Avoid This Approach
- Your system has fewer than 10 microservices or handles less than 10K req/s. The complexity of a wide-event pipeline isn't justified.
- Your team has no dedicated SRE/infra role. Wide events require ongoing pipeline maintenance — schema versioning, retention tuning, cost monitoring. Without someone owning it, you'll get event drift and cost creep.
- Your primary observability need is compliance auditing, not debugging. Compliance needs immutable, timestamped, tamper-proof records — which look more like append-only logs than mutable wide events.
8. Related Work / Further Reading
- Honeycomb's "Observability 2.0" thesis: Charity Majors' writing on structured events is the canonical reference for wide events. "Observability: A Manifesto" is the starting point.
- Cindy Sridharan's "Distributed Systems Observability": A pragmatic guide to the three pillars that predates the wide events movement — useful as a historical baseline.
- OpenTelemetry: The industry standard for instrumentation. OTel's span attributes and resource attributes map directly to wide event dimensions. Most teams adopt OTel as their instrumentation layer and choose their storage backend (Honeycomb, Grafana Tempo, SigNoz) separately.
- ClickHouse as a wide-event store: Increasingly popular for teams that want to self-host. Columnar, high compression ratio for repeated strings, and SQL-compatible. The tradeoff is operational overhead vs. managed solutions.
- Stripe's observability blog posts: Stripe's approach to structured event instrumentation at scale — they've discussed their "decision log" pattern where every payment decision produces a structured event with the full reasoning context.
9. Conclusion
The "three pillars" model isn't wrong — it's incomplete. Logs, metrics, and traces were designed for different tools at different times in the industry's evolution. They don't compose. When you need to understand a production incident, they hand you three separate views that you have to mentally stitch together.
Wide events are the gap-filler that makes the three pillars obsolete as independent primitives. One event, one query interface, one tool — with the ability to ask any question about your production system, including questions you didn't anticipate.
The teams that move fastest from "something is wrong" to "here's exactly why" aren't the teams with the most sophisticated tracing setup or the best Grafana dashboards. They're the teams that can ask an arbitrary question about their system and get an answer in under 30 seconds.
The golden rule: If producing observability data takes more effort than debugging the incident it was meant to solve, your observability strategy is backwards. Instruments once with wide events. Query infinitely.
Principal Engineer at a payments platform. Previously built payment-routing infrastructure handling $XXB GMV. Writes about distributed systems, reliability, and the engineering meta-skills that actually move the needle.