Files
Homelab/docs/guides/DISASTER_RECOVERY.md

376 lines
7.9 KiB
Markdown

# Disaster Recovery Guide
## Overview
This guide provides procedures for recovering from various failure scenarios in the homelab.
## Quick Recovery Matrix
| Scenario | Impact | Recovery Time | Procedure |
|----------|--------|---------------|-----------|
| Single node failure | Partial | < 5 min | [Node Failure](#node-failure) |
| Manager node down | Service disruption | < 10 min | [Manager Recovery](#manager-node-recovery) |
| Storage failure | Data risk | < 30 min | [Storage Recovery](#storage-failure) |
| Network outage | Complete | < 15 min | [Network Recovery](#network-recovery) |
| Complete disaster | Full rebuild | < 2 hours | [Full Recovery](#complete-disaster-recovery) |
---
## Node Failure
### Symptoms
- Node unreachable via SSH
- Docker services not running on node
- Swarm reports node as "Down"
### Recovery Steps
1. **Verify node status**:
```bash
docker node ls
# Look for "Down" status
```
2. **Attempt to restart node** (if accessible):
```bash
ssh user@<node-ip>
sudo reboot
```
3. **If node is unrecoverable**:
```bash
# Remove from Swarm
docker node rm <node-id> --force
# Services will automatically reschedule to healthy nodes
```
4. **Add replacement node**:
```bash
# On manager node, get join token
docker swarm join-token worker
# On new node, join swarm
docker swarm join --token <token> 192.168.1.196:2377
```
---
## Manager Node Recovery
### Symptoms
- Cannot access Portainer UI
- Swarm commands fail
- DNS services disrupted
### Recovery Steps
1. **Promote a worker to manager** (from another manager if available):
```bash
docker node promote <worker-node-id>
```
2. **Restore from backup**:
```bash
# Stop Docker on failed manager
sudo systemctl stop docker
# Restore Portainer data
restic restore latest --target /tmp/restore
sudo cp -r /tmp/restore/portainer /var/lib/docker/volumes/portainer/_data/
# Start Docker
sudo systemctl start docker
```
3. **Reconfigure DNS** (if Pi-hole affected):
```bash
# Temporarily point router DNS to another Pi-hole instance
# Update router DNS to: 192.168.1.245, 192.168.1.62
```
---
## Storage Failure
### ZFS Pool Failure
#### Symptoms
- `zpool status` shows DEGRADED or FAULTED
- I/O errors in logs
#### Recovery Steps
1. **Check pool status**:
```bash
zpool status tank
```
2. **If disk failed**:
```bash
# Replace failed disk
zpool replace tank /dev/old-disk /dev/new-disk
# Monitor resilver progress
watch zpool status tank
```
3. **If pool is destroyed**:
```bash
# Recreate pool
bash /workspace/homelab/scripts/zfs_setup.sh
# Restore from backup
restic restore latest --target /tank/docker
```
### NAS Failure
#### Recovery Steps
1. **Check NAS connectivity**:
```bash
ping 192.168.1.200
mount | grep /mnt/nas
```
2. **Remount NAS**:
```bash
sudo umount /mnt/nas
sudo mount -a
```
3. **If NAS hardware failed**:
- Services using NAS volumes will fail
- Redeploy services to use local storage temporarily
- Restore NAS from Time Capsule backup
---
## Network Recovery
### Complete Network Outage
#### Recovery Steps
1. **Check physical connections**:
- Verify all cables connected
- Check switch power and status LEDs
- Restart switch
2. **Verify router**:
```bash
ping 192.168.1.1
# If no response, restart router
```
3. **Check VLAN configuration**:
```bash
ip -d link show
# Reapply if needed
bash /workspace/homelab/scripts/vlan_firewall.sh
```
4. **Restart networking**:
```bash
sudo systemctl restart networking
# Or on each node:
sudo reboot
```
### Partial Network Issues
#### DNS Not Resolving
```bash
# Check Pi-hole status
docker ps | grep pihole
# Restart Pi-hole
docker restart <pihole-container>
# Temporarily use public DNS
sudo echo "nameserver 8.8.8.8" > /etc/resolv.conf
```
#### Traefik Not Routing
```bash
# Check Traefik service
docker service ls | grep traefik
docker service ps traefik_traefik
# Check logs
docker service logs traefik_traefik
# Force update
docker service update --force traefik_traefik
```
---
## Complete Disaster Recovery
### Scenario: Total Infrastructure Loss
#### Prerequisites
- Restic backups to Backblaze B2 (off-site)
- Hardware replacement available
- Network infrastructure functional
#### Recovery Steps
1. **Rebuild Core Infrastructure** (2-4 hours):
```bash
# Install base OS on all nodes
# Configure network (static IPs, hostnames)
# Install Docker on all nodes
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
# Initialize Swarm on manager
docker swarm init --advertise-addr 192.168.1.196
# Join workers
docker swarm join-token worker # Get token
# Run on each worker with token
```
2. **Restore Storage**:
```bash
# Recreate ZFS pool
bash /workspace/homelab/scripts/zfs_setup.sh
# Mount NAS
# Follow: /workspace/homelab/docs/guides/NAS_Mount_Guide.md
```
3. **Restore from Backups**:
```bash
# Install restic
sudo apt-get install restic
# Configure credentials
export B2_ACCOUNT_ID="..."
export B2_ACCOUNT_KEY="..."
export RESTIC_REPOSITORY="b2:bucket:/backups"
export RESTIC_PASSWORD="..."
# List snapshots
restic snapshots
# Restore latest
restic restore latest --target /tmp/restore
# Copy to Docker volumes
sudo cp -r /tmp/restore/* /var/lib/docker/volumes/
```
4. **Redeploy Services**:
```bash
# Deploy all stacks
bash /workspace/homelab/scripts/deploy_all.sh
# Verify deployment
bash /workspace/homelab/scripts/validate_deployment.sh
```
5. **Verify Recovery**:
- Check all services: `docker service ls`
- Test Traefik routing: `curl https://your-domain.com`
- Verify Portainer UI access
- Check Grafana dashboards
- Test Home Assistant
---
## Backup Verification
### Monthly Backup Test
```bash
# List snapshots
restic snapshots
# Verify specific snapshot
restic check --read-data-subset=10%
# Test restore
mkdir /tmp/restore-test
restic restore <snapshot-id> --target /tmp/restore-test --include /path/to/critical/file
# Compare with original
diff -r /tmp/restore-test /original/path
```
---
## Emergency Contacts & Resources
### Critical Information
- **Backblaze B2 Login**: Store credentials in password manager
- **restic Password**: Store securely (CANNOT be recovered)
- **Router Admin**: Keep credentials accessible
- **ISP Support**: Keep contact info handy
### Documentation URLs
- Docker Swarm: https://docs.docker.com/engine/swarm/
- Traefik: https://doc.traefik.io/traefik/
- Restic: https://restic.readthedocs.io/
- ZFS: https://openzfs.github.io/openzfs-docs/
---
## Recovery Checklists
### Pre-Disaster Preparation
- [ ] Verify backups running daily
- [ ] Test restore procedure monthly
- [ ] Document all credentials
- [ ] Keep hardware spares (cables, drives)
- [ ] Maintain off-site config copies
### Post-Recovery Validation
- [ ] All nodes online: `docker node ls`
- [ ] All services running: `docker service ls`
- [ ] Health checks passing: `docker ps --filter health=healthy`
- [ ] DNS resolving correctly
- [ ] Monitoring active (Grafana accessible)
- [ ] Backups resumed: `systemctl status restic-backup.timer`
- [ ] fail2ban protecting: `fail2ban-client status`
- [ ] Network performance normal: `bash network_performance_test.sh`
---
## Automation for Faster Recovery
### Create Recovery USB Drive
```bash
# Copy all scripts and configs
mkdir /mnt/usb/homelab-recovery
cp -r /workspace/homelab/* /mnt/usb/homelab-recovery/
# Include documentation
cp /workspace/homelab/docs/guides/* /mnt/usb/homelab-recovery/docs/
# Store credentials (encrypted)
# Use GPG or similar to encrypt sensitive files
```
### Quick Deploy Script
```bash
# Run from recovery USB
sudo bash /mnt/usb/homelab-recovery/scripts/deploy_all.sh
```
---
This guide should be reviewed and updated quarterly to ensure accuracy.