Site Reliability Engineering (SRE) is not just about “keeping things up” — it’s about building systems that are reliable and understandable. At the heart of this idea lies a simple but powerful toolset: the four golden signals.
Let’s break them down in human terms — no jargon, just practical insights.
🚨 What Are the Golden Signals?
Golden signals are the four key metrics that Google’s SRE team recommends tracking for any user-facing service:
- Latency — how long does it take to handle a request?
- Traffic — how many requests are coming in?
- Errors — how many requests fail?
- Saturation — how close is your system to its limits?
🕒 1. Latency
This is how long your system takes to respond. A user clicks a button — how fast do they get a response?
💡 Tip: track latency for both successful and failed requests. A fast failure is better than a slow one.
Prometheus metric example:
http_request_duration_seconds
📈 2. Traffic
Traffic shows the volume of activity. It could be requests per second (RPS), active connections, user sessions, or data throughput.
Metric example:
http_requests_total
❌ 3. Errors
Errors are failed requests — 5xx codes, timeouts, logic exceptions, etc. Even a small percentage of errors can ruin the user experience.
Metric example:
http_requests_errors_total
💥 4. Saturation
Saturation means how “full” your system is — CPU, memory, disk I/O, database connections. If you’re constantly near 100%, you’re living dangerously.
Metric examples:
node_cpu_seconds_total
container_memory_usage_bytes
🛠 How to Use Golden Signals
To make the most of them:
- Collect these metrics with Prometheus, Datadog, New Relic, etc.
- Build dashboards for each signal.
- Set up alerts for threshold breaches.
- Watch for trends — rising latency, creeping saturation, error spikes.
🎯 Final Thoughts
If you only track four things, track these.
Golden signals give you the fastest feedback when your system is in trouble — or about to be. They won’t tell you everything, but they’ll get you 80% of the way there.
Share this post with your team and follow the blog for more SRE/DevOps insights!