7.9 KiB
7.9 KiB
Disaster Recovery Guide
Overview
This guide provides procedures for recovering from various failure scenarios in the homelab.
Quick Recovery Matrix
| Scenario | Impact | Recovery Time | Procedure |
|---|---|---|---|
| Single node failure | Partial | < 5 min | Node Failure |
| Manager node down | Service disruption | < 10 min | Manager Recovery |
| Storage failure | Data risk | < 30 min | Storage Recovery |
| Network outage | Complete | < 15 min | Network Recovery |
| Complete disaster | Full rebuild | < 2 hours | Full Recovery |
Node Failure
Symptoms
- Node unreachable via SSH
- Docker services not running on node
- Swarm reports node as "Down"
Recovery Steps
-
Verify node status:
docker node ls # Look for "Down" status -
Attempt to restart node (if accessible):
ssh user@<node-ip> sudo reboot -
If node is unrecoverable:
# Remove from Swarm docker node rm <node-id> --force # Services will automatically reschedule to healthy nodes -
Add replacement node:
# On manager node, get join token docker swarm join-token worker # On new node, join swarm docker swarm join --token <token> 192.168.1.196:2377
Manager Node Recovery
Symptoms
- Cannot access Portainer UI
- Swarm commands fail
- DNS services disrupted
Recovery Steps
-
Promote a worker to manager (from another manager if available):
docker node promote <worker-node-id> -
Restore from backup:
# Stop Docker on failed manager sudo systemctl stop docker # Restore Portainer data restic restore latest --target /tmp/restore sudo cp -r /tmp/restore/portainer /var/lib/docker/volumes/portainer/_data/ # Start Docker sudo systemctl start docker -
Reconfigure DNS (if Pi-hole affected):
# Temporarily point router DNS to another Pi-hole instance # Update router DNS to: 192.168.1.245, 192.168.1.62
Storage Failure
ZFS Pool Failure
Symptoms
zpool statusshows DEGRADED or FAULTED- I/O errors in logs
Recovery Steps
-
Check pool status:
zpool status tank -
If disk failed:
# Replace failed disk zpool replace tank /dev/old-disk /dev/new-disk # Monitor resilver progress watch zpool status tank -
If pool is destroyed:
# Recreate pool bash /workspace/homelab/scripts/zfs_setup.sh # Restore from backup restic restore latest --target /tank/docker
NAS Failure
Recovery Steps
-
Check NAS connectivity:
ping 192.168.1.200 mount | grep /mnt/nas -
Remount NAS:
sudo umount /mnt/nas sudo mount -a -
If NAS hardware failed:
- Services using NAS volumes will fail
- Redeploy services to use local storage temporarily
- Restore NAS from Time Capsule backup
Network Recovery
Complete Network Outage
Recovery Steps
-
Check physical connections:
- Verify all cables connected
- Check switch power and status LEDs
- Restart switch
-
Verify router:
ping 192.168.1.1 # If no response, restart router -
Check VLAN configuration:
ip -d link show # Reapply if needed bash /workspace/homelab/scripts/vlan_firewall.sh -
Restart networking:
sudo systemctl restart networking # Or on each node: sudo reboot
Partial Network Issues
DNS Not Resolving
# Check Pi-hole status
docker ps | grep pihole
# Restart Pi-hole
docker restart <pihole-container>
# Temporarily use public DNS
sudo echo "nameserver 8.8.8.8" > /etc/resolv.conf
Traefik Not Routing
# Check Traefik service
docker service ls | grep traefik
docker service ps traefik_traefik
# Check logs
docker service logs traefik_traefik
# Force update
docker service update --force traefik_traefik
Complete Disaster Recovery
Scenario: Total Infrastructure Loss
Prerequisites
- Restic backups to Backblaze B2 (off-site)
- Hardware replacement available
- Network infrastructure functional
Recovery Steps
-
Rebuild Core Infrastructure (2-4 hours):
# Install base OS on all nodes # Configure network (static IPs, hostnames) # Install Docker on all nodes curl -fsSL https://get.docker.com | sh sudo usermod -aG docker $USER # Initialize Swarm on manager docker swarm init --advertise-addr 192.168.1.196 # Join workers docker swarm join-token worker # Get token # Run on each worker with token -
Restore Storage:
# Recreate ZFS pool bash /workspace/homelab/scripts/zfs_setup.sh # Mount NAS # Follow: /workspace/homelab/docs/guides/NAS_Mount_Guide.md -
Restore from Backups:
# Install restic sudo apt-get install restic # Configure credentials export B2_ACCOUNT_ID="..." export B2_ACCOUNT_KEY="..." export RESTIC_REPOSITORY="b2:bucket:/backups" export RESTIC_PASSWORD="..." # List snapshots restic snapshots # Restore latest restic restore latest --target /tmp/restore # Copy to Docker volumes sudo cp -r /tmp/restore/* /var/lib/docker/volumes/ -
Redeploy Services:
# Deploy all stacks bash /workspace/homelab/scripts/deploy_all.sh # Verify deployment bash /workspace/homelab/scripts/validate_deployment.sh -
Verify Recovery:
- Check all services:
docker service ls - Test Traefik routing:
curl https://your-domain.com - Verify Portainer UI access
- Check Grafana dashboards
- Test Home Assistant
- Check all services:
Backup Verification
Monthly Backup Test
# List snapshots
restic snapshots
# Verify specific snapshot
restic check --read-data-subset=10%
# Test restore
mkdir /tmp/restore-test
restic restore <snapshot-id> --target /tmp/restore-test --include /path/to/critical/file
# Compare with original
diff -r /tmp/restore-test /original/path
Emergency Contacts & Resources
Critical Information
- Backblaze B2 Login: Store credentials in password manager
- restic Password: Store securely (CANNOT be recovered)
- Router Admin: Keep credentials accessible
- ISP Support: Keep contact info handy
Documentation URLs
- Docker Swarm: https://docs.docker.com/engine/swarm/
- Traefik: https://doc.traefik.io/traefik/
- Restic: https://restic.readthedocs.io/
- ZFS: https://openzfs.github.io/openzfs-docs/
Recovery Checklists
Pre-Disaster Preparation
- Verify backups running daily
- Test restore procedure monthly
- Document all credentials
- Keep hardware spares (cables, drives)
- Maintain off-site config copies
Post-Recovery Validation
- All nodes online:
docker node ls - All services running:
docker service ls - Health checks passing:
docker ps --filter health=healthy - DNS resolving correctly
- Monitoring active (Grafana accessible)
- Backups resumed:
systemctl status restic-backup.timer - fail2ban protecting:
fail2ban-client status - Network performance normal:
bash network_performance_test.sh
Automation for Faster Recovery
Create Recovery USB Drive
# Copy all scripts and configs
mkdir /mnt/usb/homelab-recovery
cp -r /workspace/homelab/* /mnt/usb/homelab-recovery/
# Include documentation
cp /workspace/homelab/docs/guides/* /mnt/usb/homelab-recovery/docs/
# Store credentials (encrypted)
# Use GPG or similar to encrypt sensitive files
Quick Deploy Script
# Run from recovery USB
sudo bash /mnt/usb/homelab-recovery/scripts/deploy_all.sh
This guide should be reviewed and updated quarterly to ensure accuracy.