What is a Percentile in Observability?

When we talk about observability, especially metrics like latency, we often hear terms such as p50, p95, or p99. These are percentiles. They give us a way to understand not just the average behavior of a system, but how it performs for the majority (or the unlucky few) of requests.

Simple Definition

A percentile tells you the value below which a given percentage of measurements fall.

p50 (50th percentile) — the median. Half of requests are faster, half are slower.
p95 (95th percentile) — 95% of requests are as fast or faster than this value, 5% are slower.
p99 (99th percentile) — only 1% of requests are slower than this.

Why Percentiles Matter

If you only look at average latency, you might think your service is “fast”.
Example:

Average latency: 100 ms
But p99 latency: 2 seconds

This means 1% of your users are having a really bad experience. That’s something averages hide, but percentiles reveal.

Practical Example

Let’s say you’re measuring API response times:

p50 = 120 ms (most requests are quick)
p95 = 600 ms (some requests are slower)
p99 = 2 seconds (a few are painfully slow)

This tells your SRE/DevOps team that occasional spikes exist and may need optimization.

Percentiles in SRE Practice

Google’s SRE book emphasizes percentiles because they map well to user experience.

SLIs (Service Level Indicators) often use percentiles (e.g., “95% of requests < 300ms”).
SLAs and SLOs are usually built around these guarantees.

Conclusion

Percentiles are a critical tool in observability. They give a more realistic view of system performance than averages, highlighting slow outliers and ensuring you focus on the actual user experience.

✅ Next time you monitor your service, don’t just look at averages — check your p95 and p99. They tell the real story.

What is a Percentile in Observability?#

Simple Definition#

Why Percentiles Matter#

Practical Example#

Percentiles in SRE Practice#

Conclusion#