🕸 Kubernetes Disaster Recovery: A Guide to etcd Backup and Restore

4 min readOct 15, 2024

Kubernetes is a highly available and scalable platform for managing containerized applications, but to ensure that your cluster is resilient to failures, you need to protect the state of its core component: etcd. As the backing store for Kubernetes, etcd holds critical data that defines your cluster, such as configuration details, secrets, and metadata about all the running workloads. If etcd data is lost or corrupted, the entire cluster could be affected.

In this article, we’ll dive deep into how to back up and restore the etcd database, an essential task for any Kubernetes administrator. We’ll also cover where to find important paths, how to install etcdctl, and provide best practices for automating backups in production.

Why Backing Up etcd is Crucial

Kubernetes relies on etcd to store the state of its components—everything from nodes and pods to ConfigMaps, secrets, and policies. A failure in etcd could lead to the complete loss of your cluster’s state, resulting in service disruptions or worse, permanent data loss.

Regular backups of etcd ensure that you can restore your cluster’s state in the event of a failure, preventing potential downtime.

etcd Backup and Restore: Step-by-Step Guide

Prerequisites

Before performing backups or restores, you need:

Access to the control plane: etcd runs on the control plane nodes in your Kubernetes cluster.
etcdctl tool: This command-line tool is used to interact with etcd. You’ll need it to take backups and restore from snapshots.
Certificates and Keys: Since etcd is usually secured, you'll need the necessary certificates (ca.crt, server.crt, and server.key) to authenticate with etcd.

Step 1: Install the etcdctl Tool

The etcdctl tool is usually not pre-installed, so you'll need to install it. You can install the version of etcdctl that matches your Kubernetes version.

Download etcd binaries: Go to the official etcd releases page and download the appropriate release for your system.

Example for Linux:

wget https://github.com/etcd-io/etcd/releases/download/v3.5.0/etcd-v3.5.0-linux-amd64.tar.gz
tar -xvf etcd-v3.5.0-linux-amd64.tar.gz
sudo mv etcd-v3.5.0-linux-amd64/etcdctl /usr/local/bin/

Verify installation: After installing, check that etcdctl is available by running:

etcdctl version

Step 2: Take a Backup of etcd

To manually take an etcd backup, you need to use the etcdctl tool. Here's how to back up the state of your Kubernetes cluster.

Get etcd Paths

You’ll need the following paths on your control plane node:

ca.crt: The certificate authority file that etcd uses for securing communications. Typically located at /etc/kubernetes/pki/etcd/ca.crt.
server.crt and server.key: The certificates used by the etcd server. They are usually stored in /etc/kubernetes/pki/etcd/server.crt and /etc/kubernetes/pki/etcd/server.key.
Endpoint: The endpoint of your etcd service, typically https://127.0.0.1:2379.

You can verify these paths by navigating to the directories or by checking your Kubernetes manifest under /etc/kubernetes/manifests/etcd.yaml.

Create the Backup

With these paths in hand, you can now create a backup snapshot.

ETCDCTL_API=3 etcdctl snapshot save /path/to/backup.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

/path/to/backup.db: This is the location where you want to store the backup file.
Ensure that the --cacert, --cert, and --key flags point to the correct certificate files.

After running this command, the backup file (backup.db) will be saved at the specified location.

Step 3: Automate etcd Backups

To avoid manually creating backups, especially in production environments, it’s recommended to automate the backup process. You can achieve this by scheduling a cron job on the control plane node.

Example of a cron job that creates a daily backup at 2 AM:

0 2 * * * /bin/bash -c 'ETCDCTL_API=3 etcdctl snapshot save /backups/etcd-backup-$(date +\%Y-\%m-\%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key'

This cron job generates a timestamped backup and stores it in the /backups directory every day.

Step 4: Restore etcd from a Backup

Restoring etcd from a backup is necessary when recovering from a failure or corruption. Follow these steps to restore your etcd cluster from a snapshot.

Stop the etcd Service

Before restoring, stop the etcd service on the control plane node:

sudo systemctl stop etcd

Restore the Snapshot

Use the etcdctl snapshot restore command to restore the backup:

ETCDCTL_API=3 etcdctl snapshot restore /path/to/backup.db \
  --data-dir /var/lib/etcd/new.etcd \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

--data-dir: The directory where the restored etcd data will be stored.

Replace the Original Data Directory

Once the restore is complete, move the restored data to the actual etcd data directory:

mv /var/lib/etcd /var/lib/etcd-backup
mv /var/lib/etcd/new.etcd /var/lib/etcd

Restart etcd

After the data is restored, restart the etcd service:

sudo systemctl start etcd

Check Cluster Health

Ensure that the etcd cluster is healthy:

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

You should see that the etcd node is healthy.

Best Practices for etcd Backup and Restore

Backup Frequency: Regularly back up etcd, especially after significant changes to the cluster.
Storage Location: Store etcd backups in a secure, off-site location (e.g., cloud storage).
Automated Monitoring: Use monitoring tools (e.g., Prometheus or etcd alarms) to keep an eye on the health of your etcd cluster.
Backup Validation: Periodically test your backups by restoring them in a non-production environment to ensure they are usable.

Conclusion

Backing up and restoring etcd is crucial to maintaining the integrity and availability of your Kubernetes cluster. By regularly backing up etcd and automating the process in production, you can safeguard your cluster’s state and quickly recover from failures. Whether you're managing a small development cluster or a large production environment, etcd backup strategies are essential for effective disaster recovery.