Introduction

etcd is a distributed key-value store that plays a critical role in Kubernetes by storing cluster configuration and state. When etcd runs into problems, it can cause cluster instability or downtime. This article covers common etcd errors, their underlying causes, and actionable solutions.


1. etcdserver: request timed out

❓ Cause

Occurs when etcd members can’t communicate efficiently, often due to network issues or disk I/O latency.

🛠️ Solution

  • Check disk performance:
    iostat -xz 1
    
  • Ensure etcd data is on SSD storage.
  • Check network latency and connectivity between cluster members:
    ping <etcd-member-IP>
    

2. etcdserver: leader changed

❓ Cause

This is often seen when leadership changes too frequently, indicating instability in the etcd cluster.

🛠️ Solution

  • Check resource usage (CPU/memory).
  • Ensure clock synchronization (e.g., using NTP).
  • Investigate logs:
    journalctl -u etcd
    

3. etcdserver: mvcc: database space exceeded

❓ Cause

The etcd database has grown beyond the default quota (usually 2GB).

🛠️ Solution

  • Increase size quota:
    ETCD_QUOTA_BACKEND_BYTES=8589934592
    
  • Run defragmentation:
    etcdctl defrag
    

4. Data Inconsistency Between Members

❓ Cause

Network partitions or clock drift can result in inconsistent views.

🛠️ Solution

  • Ensure proper peer communication ports are open.
  • Sync system clocks using NTP.
  • Replace failed member using snapshot restore.

5. Cluster is Unhealthy

❓ Cause

One or more etcd members are unreachable.

🛠️ Solution

  • Check etcd pod logs:
    kubectl logs -n kube-system etcd-<node-name>
    
  • Ensure firewall rules or security groups aren’t blocking peer communication.

Best Practices

  • Use SSDs for etcd storage.
  • Enable regular snapshot backups.
  • Monitor etcd metrics via Prometheus/Grafana.
  • Avoid colocating etcd with other high-I/O workloads.

Conclusion

etcd is vital for Kubernetes stability. Knowing how to diagnose and address common issues can help prevent major outages. Use this guide to troubleshoot effectively and keep your cluster healthy.