What is a Percentile in Observability?

When we talk about observability, especially metrics like latency, we often hear terms such as p50, p95, or p99. These are percentiles. They give us a way to understand not just the average behavior of a system, but how it performs for the majority (or the unlucky few) of requests.


Simple Definition

A percentile tells you the value below which a given percentage of measurements fall.

  • p50 (50th percentile) — the median. Half of requests are faster, half are slower.
  • p95 (95th percentile) — 95% of requests are as fast or faster than this value, 5% are slower.
  • p99 (99th percentile) — only 1% of requests are slower than this.

Why Percentiles Matter

If you only look at average latency, you might think your service is “fast”.
Example:

  • Average latency: 100 ms
  • But p99 latency: 2 seconds

This means 1% of your users are having a really bad experience. That’s something averages hide, but percentiles reveal.


Practical Example

Let’s say you’re measuring API response times:

  • p50 = 120 ms (most requests are quick)
  • p95 = 600 ms (some requests are slower)
  • p99 = 2 seconds (a few are painfully slow)

This tells your SRE/DevOps team that occasional spikes exist and may need optimization.


Percentiles in SRE Practice

Google’s SRE book emphasizes percentiles because they map well to user experience.

  • SLIs (Service Level Indicators) often use percentiles (e.g., “95% of requests < 300ms”).
  • SLAs and SLOs are usually built around these guarantees.

Conclusion

Percentiles are a critical tool in observability. They give a more realistic view of system performance than averages, highlighting slow outliers and ensuring you focus on the actual user experience.


✅ Next time you monitor your service, don’t just look at averages — check your p95 and p99. They tell the real story.