376 lines
7.9 KiB
Markdown
376 lines
7.9 KiB
Markdown
# Disaster Recovery Guide
|
|
|
|
## Overview
|
|
This guide provides procedures for recovering from various failure scenarios in the homelab.
|
|
|
|
## Quick Recovery Matrix
|
|
|
|
| Scenario | Impact | Recovery Time | Procedure |
|
|
|----------|--------|---------------|-----------|
|
|
| Single node failure | Partial | < 5 min | [Node Failure](#node-failure) |
|
|
| Manager node down | Service disruption | < 10 min | [Manager Recovery](#manager-node-recovery) |
|
|
| Storage failure | Data risk | < 30 min | [Storage Recovery](#storage-failure) |
|
|
| Network outage | Complete | < 15 min | [Network Recovery](#network-recovery) |
|
|
| Complete disaster | Full rebuild | < 2 hours | [Full Recovery](#complete-disaster-recovery) |
|
|
|
|
---
|
|
|
|
## Node Failure
|
|
|
|
### Symptoms
|
|
- Node unreachable via SSH
|
|
- Docker services not running on node
|
|
- Swarm reports node as "Down"
|
|
|
|
### Recovery Steps
|
|
|
|
1. **Verify node status**:
|
|
```bash
|
|
docker node ls
|
|
# Look for "Down" status
|
|
```
|
|
|
|
2. **Attempt to restart node** (if accessible):
|
|
```bash
|
|
ssh user@<node-ip>
|
|
sudo reboot
|
|
```
|
|
|
|
3. **If node is unrecoverable**:
|
|
```bash
|
|
# Remove from Swarm
|
|
docker node rm <node-id> --force
|
|
|
|
# Services will automatically reschedule to healthy nodes
|
|
```
|
|
|
|
4. **Add replacement node**:
|
|
```bash
|
|
# On manager node, get join token
|
|
docker swarm join-token worker
|
|
|
|
# On new node, join swarm
|
|
docker swarm join --token <token> 192.168.1.196:2377
|
|
```
|
|
|
|
---
|
|
|
|
## Manager Node Recovery
|
|
|
|
### Symptoms
|
|
- Cannot access Portainer UI
|
|
- Swarm commands fail
|
|
- DNS services disrupted
|
|
|
|
### Recovery Steps
|
|
|
|
1. **Promote a worker to manager** (from another manager if available):
|
|
```bash
|
|
docker node promote <worker-node-id>
|
|
```
|
|
|
|
2. **Restore from backup**:
|
|
```bash
|
|
# Stop Docker on failed manager
|
|
sudo systemctl stop docker
|
|
|
|
# Restore Portainer data
|
|
restic restore latest --target /tmp/restore
|
|
sudo cp -r /tmp/restore/portainer /var/lib/docker/volumes/portainer/_data/
|
|
|
|
# Start Docker
|
|
sudo systemctl start docker
|
|
```
|
|
|
|
3. **Reconfigure DNS** (if Pi-hole affected):
|
|
```bash
|
|
# Temporarily point router DNS to another Pi-hole instance
|
|
# Update router DNS to: 192.168.1.245, 192.168.1.62
|
|
```
|
|
|
|
---
|
|
|
|
## Storage Failure
|
|
|
|
### ZFS Pool Failure
|
|
|
|
#### Symptoms
|
|
- `zpool status` shows DEGRADED or FAULTED
|
|
- I/O errors in logs
|
|
|
|
#### Recovery Steps
|
|
|
|
1. **Check pool status**:
|
|
```bash
|
|
zpool status tank
|
|
```
|
|
|
|
2. **If disk failed**:
|
|
```bash
|
|
# Replace failed disk
|
|
zpool replace tank /dev/old-disk /dev/new-disk
|
|
|
|
# Monitor resilver progress
|
|
watch zpool status tank
|
|
```
|
|
|
|
3. **If pool is destroyed**:
|
|
```bash
|
|
# Recreate pool
|
|
bash /workspace/homelab/scripts/zfs_setup.sh
|
|
|
|
# Restore from backup
|
|
restic restore latest --target /tank/docker
|
|
```
|
|
|
|
### NAS Failure
|
|
|
|
#### Recovery Steps
|
|
|
|
1. **Check NAS connectivity**:
|
|
```bash
|
|
ping 192.168.1.200
|
|
mount | grep /mnt/nas
|
|
```
|
|
|
|
2. **Remount NAS**:
|
|
```bash
|
|
sudo umount /mnt/nas
|
|
sudo mount -a
|
|
```
|
|
|
|
3. **If NAS hardware failed**:
|
|
- Services using NAS volumes will fail
|
|
- Redeploy services to use local storage temporarily
|
|
- Restore NAS from Time Capsule backup
|
|
|
|
---
|
|
|
|
## Network Recovery
|
|
|
|
### Complete Network Outage
|
|
|
|
#### Recovery Steps
|
|
|
|
1. **Check physical connections**:
|
|
- Verify all cables connected
|
|
- Check switch power and status LEDs
|
|
- Restart switch
|
|
|
|
2. **Verify router**:
|
|
```bash
|
|
ping 192.168.1.1
|
|
# If no response, restart router
|
|
```
|
|
|
|
3. **Check VLAN configuration**:
|
|
```bash
|
|
ip -d link show
|
|
# Reapply if needed
|
|
bash /workspace/homelab/scripts/vlan_firewall.sh
|
|
```
|
|
|
|
4. **Restart networking**:
|
|
```bash
|
|
sudo systemctl restart networking
|
|
# Or on each node:
|
|
sudo reboot
|
|
```
|
|
|
|
### Partial Network Issues
|
|
|
|
#### DNS Not Resolving
|
|
|
|
```bash
|
|
# Check Pi-hole status
|
|
docker ps | grep pihole
|
|
|
|
# Restart Pi-hole
|
|
docker restart <pihole-container>
|
|
|
|
# Temporarily use public DNS
|
|
sudo echo "nameserver 8.8.8.8" > /etc/resolv.conf
|
|
```
|
|
|
|
#### Traefik Not Routing
|
|
|
|
```bash
|
|
# Check Traefik service
|
|
docker service ls | grep traefik
|
|
docker service ps traefik_traefik
|
|
|
|
# Check logs
|
|
docker service logs traefik_traefik
|
|
|
|
# Force update
|
|
docker service update --force traefik_traefik
|
|
```
|
|
|
|
---
|
|
|
|
## Complete Disaster Recovery
|
|
|
|
### Scenario: Total Infrastructure Loss
|
|
|
|
#### Prerequisites
|
|
- Restic backups to Backblaze B2 (off-site)
|
|
- Hardware replacement available
|
|
- Network infrastructure functional
|
|
|
|
#### Recovery Steps
|
|
|
|
1. **Rebuild Core Infrastructure** (2-4 hours):
|
|
|
|
```bash
|
|
# Install base OS on all nodes
|
|
# Configure network (static IPs, hostnames)
|
|
|
|
# Install Docker on all nodes
|
|
curl -fsSL https://get.docker.com | sh
|
|
sudo usermod -aG docker $USER
|
|
|
|
# Initialize Swarm on manager
|
|
docker swarm init --advertise-addr 192.168.1.196
|
|
|
|
# Join workers
|
|
docker swarm join-token worker # Get token
|
|
# Run on each worker with token
|
|
```
|
|
|
|
2. **Restore Storage**:
|
|
|
|
```bash
|
|
# Recreate ZFS pool
|
|
bash /workspace/homelab/scripts/zfs_setup.sh
|
|
|
|
# Mount NAS
|
|
# Follow: /workspace/homelab/docs/guides/NAS_Mount_Guide.md
|
|
```
|
|
|
|
3. **Restore from Backups**:
|
|
|
|
```bash
|
|
# Install restic
|
|
sudo apt-get install restic
|
|
|
|
# Configure credentials
|
|
export B2_ACCOUNT_ID="..."
|
|
export B2_ACCOUNT_KEY="..."
|
|
export RESTIC_REPOSITORY="b2:bucket:/backups"
|
|
export RESTIC_PASSWORD="..."
|
|
|
|
# List snapshots
|
|
restic snapshots
|
|
|
|
# Restore latest
|
|
restic restore latest --target /tmp/restore
|
|
|
|
# Copy to Docker volumes
|
|
sudo cp -r /tmp/restore/* /var/lib/docker/volumes/
|
|
```
|
|
|
|
4. **Redeploy Services**:
|
|
|
|
```bash
|
|
# Deploy all stacks
|
|
bash /workspace/homelab/scripts/deploy_all.sh
|
|
|
|
# Verify deployment
|
|
bash /workspace/homelab/scripts/validate_deployment.sh
|
|
```
|
|
|
|
5. **Verify Recovery**:
|
|
|
|
- Check all services: `docker service ls`
|
|
- Test Traefik routing: `curl https://your-domain.com`
|
|
- Verify Portainer UI access
|
|
- Check Grafana dashboards
|
|
- Test Home Assistant
|
|
|
|
---
|
|
|
|
## Backup Verification
|
|
|
|
### Monthly Backup Test
|
|
|
|
```bash
|
|
# List snapshots
|
|
restic snapshots
|
|
|
|
# Verify specific snapshot
|
|
restic check --read-data-subset=10%
|
|
|
|
# Test restore
|
|
mkdir /tmp/restore-test
|
|
restic restore <snapshot-id> --target /tmp/restore-test --include /path/to/critical/file
|
|
|
|
# Compare with original
|
|
diff -r /tmp/restore-test /original/path
|
|
```
|
|
|
|
---
|
|
|
|
## Emergency Contacts & Resources
|
|
|
|
### Critical Information
|
|
- **Backblaze B2 Login**: Store credentials in password manager
|
|
- **restic Password**: Store securely (CANNOT be recovered)
|
|
- **Router Admin**: Keep credentials accessible
|
|
- **ISP Support**: Keep contact info handy
|
|
|
|
### Documentation URLs
|
|
- Docker Swarm: https://docs.docker.com/engine/swarm/
|
|
- Traefik: https://doc.traefik.io/traefik/
|
|
- Restic: https://restic.readthedocs.io/
|
|
- ZFS: https://openzfs.github.io/openzfs-docs/
|
|
|
|
---
|
|
|
|
## Recovery Checklists
|
|
|
|
### Pre-Disaster Preparation
|
|
- [ ] Verify backups running daily
|
|
- [ ] Test restore procedure monthly
|
|
- [ ] Document all credentials
|
|
- [ ] Keep hardware spares (cables, drives)
|
|
- [ ] Maintain off-site config copies
|
|
|
|
### Post-Recovery Validation
|
|
- [ ] All nodes online: `docker node ls`
|
|
- [ ] All services running: `docker service ls`
|
|
- [ ] Health checks passing: `docker ps --filter health=healthy`
|
|
- [ ] DNS resolving correctly
|
|
- [ ] Monitoring active (Grafana accessible)
|
|
- [ ] Backups resumed: `systemctl status restic-backup.timer`
|
|
- [ ] fail2ban protecting: `fail2ban-client status`
|
|
- [ ] Network performance normal: `bash network_performance_test.sh`
|
|
|
|
---
|
|
|
|
## Automation for Faster Recovery
|
|
|
|
### Create Recovery USB Drive
|
|
|
|
```bash
|
|
# Copy all scripts and configs
|
|
mkdir /mnt/usb/homelab-recovery
|
|
cp -r /workspace/homelab/* /mnt/usb/homelab-recovery/
|
|
|
|
# Include documentation
|
|
cp /workspace/homelab/docs/guides/* /mnt/usb/homelab-recovery/docs/
|
|
|
|
# Store credentials (encrypted)
|
|
# Use GPG or similar to encrypt sensitive files
|
|
```
|
|
|
|
### Quick Deploy Script
|
|
|
|
```bash
|
|
# Run from recovery USB
|
|
sudo bash /mnt/usb/homelab-recovery/scripts/deploy_all.sh
|
|
```
|
|
|
|
---
|
|
|
|
This guide should be reviewed and updated quarterly to ensure accuracy.
|