Initial commit: homelab configuration and documentation
This commit is contained in:
375
docs/guides/DISASTER_RECOVERY.md
Normal file
375
docs/guides/DISASTER_RECOVERY.md
Normal file
@@ -0,0 +1,375 @@
|
||||
# Disaster Recovery Guide
|
||||
|
||||
## Overview
|
||||
This guide provides procedures for recovering from various failure scenarios in the homelab.
|
||||
|
||||
## Quick Recovery Matrix
|
||||
|
||||
| Scenario | Impact | Recovery Time | Procedure |
|
||||
|----------|--------|---------------|-----------|
|
||||
| Single node failure | Partial | < 5 min | [Node Failure](#node-failure) |
|
||||
| Manager node down | Service disruption | < 10 min | [Manager Recovery](#manager-node-recovery) |
|
||||
| Storage failure | Data risk | < 30 min | [Storage Recovery](#storage-failure) |
|
||||
| Network outage | Complete | < 15 min | [Network Recovery](#network-recovery) |
|
||||
| Complete disaster | Full rebuild | < 2 hours | [Full Recovery](#complete-disaster-recovery) |
|
||||
|
||||
---
|
||||
|
||||
## Node Failure
|
||||
|
||||
### Symptoms
|
||||
- Node unreachable via SSH
|
||||
- Docker services not running on node
|
||||
- Swarm reports node as "Down"
|
||||
|
||||
### Recovery Steps
|
||||
|
||||
1. **Verify node status**:
|
||||
```bash
|
||||
docker node ls
|
||||
# Look for "Down" status
|
||||
```
|
||||
|
||||
2. **Attempt to restart node** (if accessible):
|
||||
```bash
|
||||
ssh user@<node-ip>
|
||||
sudo reboot
|
||||
```
|
||||
|
||||
3. **If node is unrecoverable**:
|
||||
```bash
|
||||
# Remove from Swarm
|
||||
docker node rm <node-id> --force
|
||||
|
||||
# Services will automatically reschedule to healthy nodes
|
||||
```
|
||||
|
||||
4. **Add replacement node**:
|
||||
```bash
|
||||
# On manager node, get join token
|
||||
docker swarm join-token worker
|
||||
|
||||
# On new node, join swarm
|
||||
docker swarm join --token <token> 192.168.1.196:2377
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Manager Node Recovery
|
||||
|
||||
### Symptoms
|
||||
- Cannot access Portainer UI
|
||||
- Swarm commands fail
|
||||
- DNS services disrupted
|
||||
|
||||
### Recovery Steps
|
||||
|
||||
1. **Promote a worker to manager** (from another manager if available):
|
||||
```bash
|
||||
docker node promote <worker-node-id>
|
||||
```
|
||||
|
||||
2. **Restore from backup**:
|
||||
```bash
|
||||
# Stop Docker on failed manager
|
||||
sudo systemctl stop docker
|
||||
|
||||
# Restore Portainer data
|
||||
restic restore latest --target /tmp/restore
|
||||
sudo cp -r /tmp/restore/portainer /var/lib/docker/volumes/portainer/_data/
|
||||
|
||||
# Start Docker
|
||||
sudo systemctl start docker
|
||||
```
|
||||
|
||||
3. **Reconfigure DNS** (if Pi-hole affected):
|
||||
```bash
|
||||
# Temporarily point router DNS to another Pi-hole instance
|
||||
# Update router DNS to: 192.168.1.245, 192.168.1.62
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Storage Failure
|
||||
|
||||
### ZFS Pool Failure
|
||||
|
||||
#### Symptoms
|
||||
- `zpool status` shows DEGRADED or FAULTED
|
||||
- I/O errors in logs
|
||||
|
||||
#### Recovery Steps
|
||||
|
||||
1. **Check pool status**:
|
||||
```bash
|
||||
zpool status tank
|
||||
```
|
||||
|
||||
2. **If disk failed**:
|
||||
```bash
|
||||
# Replace failed disk
|
||||
zpool replace tank /dev/old-disk /dev/new-disk
|
||||
|
||||
# Monitor resilver progress
|
||||
watch zpool status tank
|
||||
```
|
||||
|
||||
3. **If pool is destroyed**:
|
||||
```bash
|
||||
# Recreate pool
|
||||
bash /workspace/homelab/scripts/zfs_setup.sh
|
||||
|
||||
# Restore from backup
|
||||
restic restore latest --target /tank/docker
|
||||
```
|
||||
|
||||
### NAS Failure
|
||||
|
||||
#### Recovery Steps
|
||||
|
||||
1. **Check NAS connectivity**:
|
||||
```bash
|
||||
ping 192.168.1.200
|
||||
mount | grep /mnt/nas
|
||||
```
|
||||
|
||||
2. **Remount NAS**:
|
||||
```bash
|
||||
sudo umount /mnt/nas
|
||||
sudo mount -a
|
||||
```
|
||||
|
||||
3. **If NAS hardware failed**:
|
||||
- Services using NAS volumes will fail
|
||||
- Redeploy services to use local storage temporarily
|
||||
- Restore NAS from Time Capsule backup
|
||||
|
||||
---
|
||||
|
||||
## Network Recovery
|
||||
|
||||
### Complete Network Outage
|
||||
|
||||
#### Recovery Steps
|
||||
|
||||
1. **Check physical connections**:
|
||||
- Verify all cables connected
|
||||
- Check switch power and status LEDs
|
||||
- Restart switch
|
||||
|
||||
2. **Verify router**:
|
||||
```bash
|
||||
ping 192.168.1.1
|
||||
# If no response, restart router
|
||||
```
|
||||
|
||||
3. **Check VLAN configuration**:
|
||||
```bash
|
||||
ip -d link show
|
||||
# Reapply if needed
|
||||
bash /workspace/homelab/scripts/vlan_firewall.sh
|
||||
```
|
||||
|
||||
4. **Restart networking**:
|
||||
```bash
|
||||
sudo systemctl restart networking
|
||||
# Or on each node:
|
||||
sudo reboot
|
||||
```
|
||||
|
||||
### Partial Network Issues
|
||||
|
||||
#### DNS Not Resolving
|
||||
|
||||
```bash
|
||||
# Check Pi-hole status
|
||||
docker ps | grep pihole
|
||||
|
||||
# Restart Pi-hole
|
||||
docker restart <pihole-container>
|
||||
|
||||
# Temporarily use public DNS
|
||||
sudo echo "nameserver 8.8.8.8" > /etc/resolv.conf
|
||||
```
|
||||
|
||||
#### Traefik Not Routing
|
||||
|
||||
```bash
|
||||
# Check Traefik service
|
||||
docker service ls | grep traefik
|
||||
docker service ps traefik_traefik
|
||||
|
||||
# Check logs
|
||||
docker service logs traefik_traefik
|
||||
|
||||
# Force update
|
||||
docker service update --force traefik_traefik
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Complete Disaster Recovery
|
||||
|
||||
### Scenario: Total Infrastructure Loss
|
||||
|
||||
#### Prerequisites
|
||||
- Restic backups to Backblaze B2 (off-site)
|
||||
- Hardware replacement available
|
||||
- Network infrastructure functional
|
||||
|
||||
#### Recovery Steps
|
||||
|
||||
1. **Rebuild Core Infrastructure** (2-4 hours):
|
||||
|
||||
```bash
|
||||
# Install base OS on all nodes
|
||||
# Configure network (static IPs, hostnames)
|
||||
|
||||
# Install Docker on all nodes
|
||||
curl -fsSL https://get.docker.com | sh
|
||||
sudo usermod -aG docker $USER
|
||||
|
||||
# Initialize Swarm on manager
|
||||
docker swarm init --advertise-addr 192.168.1.196
|
||||
|
||||
# Join workers
|
||||
docker swarm join-token worker # Get token
|
||||
# Run on each worker with token
|
||||
```
|
||||
|
||||
2. **Restore Storage**:
|
||||
|
||||
```bash
|
||||
# Recreate ZFS pool
|
||||
bash /workspace/homelab/scripts/zfs_setup.sh
|
||||
|
||||
# Mount NAS
|
||||
# Follow: /workspace/homelab/docs/guides/NAS_Mount_Guide.md
|
||||
```
|
||||
|
||||
3. **Restore from Backups**:
|
||||
|
||||
```bash
|
||||
# Install restic
|
||||
sudo apt-get install restic
|
||||
|
||||
# Configure credentials
|
||||
export B2_ACCOUNT_ID="..."
|
||||
export B2_ACCOUNT_KEY="..."
|
||||
export RESTIC_REPOSITORY="b2:bucket:/backups"
|
||||
export RESTIC_PASSWORD="..."
|
||||
|
||||
# List snapshots
|
||||
restic snapshots
|
||||
|
||||
# Restore latest
|
||||
restic restore latest --target /tmp/restore
|
||||
|
||||
# Copy to Docker volumes
|
||||
sudo cp -r /tmp/restore/* /var/lib/docker/volumes/
|
||||
```
|
||||
|
||||
4. **Redeploy Services**:
|
||||
|
||||
```bash
|
||||
# Deploy all stacks
|
||||
bash /workspace/homelab/scripts/deploy_all.sh
|
||||
|
||||
# Verify deployment
|
||||
bash /workspace/homelab/scripts/validate_deployment.sh
|
||||
```
|
||||
|
||||
5. **Verify Recovery**:
|
||||
|
||||
- Check all services: `docker service ls`
|
||||
- Test Traefik routing: `curl https://your-domain.com`
|
||||
- Verify Portainer UI access
|
||||
- Check Grafana dashboards
|
||||
- Test Home Assistant
|
||||
|
||||
---
|
||||
|
||||
## Backup Verification
|
||||
|
||||
### Monthly Backup Test
|
||||
|
||||
```bash
|
||||
# List snapshots
|
||||
restic snapshots
|
||||
|
||||
# Verify specific snapshot
|
||||
restic check --read-data-subset=10%
|
||||
|
||||
# Test restore
|
||||
mkdir /tmp/restore-test
|
||||
restic restore <snapshot-id> --target /tmp/restore-test --include /path/to/critical/file
|
||||
|
||||
# Compare with original
|
||||
diff -r /tmp/restore-test /original/path
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Emergency Contacts & Resources
|
||||
|
||||
### Critical Information
|
||||
- **Backblaze B2 Login**: Store credentials in password manager
|
||||
- **restic Password**: Store securely (CANNOT be recovered)
|
||||
- **Router Admin**: Keep credentials accessible
|
||||
- **ISP Support**: Keep contact info handy
|
||||
|
||||
### Documentation URLs
|
||||
- Docker Swarm: https://docs.docker.com/engine/swarm/
|
||||
- Traefik: https://doc.traefik.io/traefik/
|
||||
- Restic: https://restic.readthedocs.io/
|
||||
- ZFS: https://openzfs.github.io/openzfs-docs/
|
||||
|
||||
---
|
||||
|
||||
## Recovery Checklists
|
||||
|
||||
### Pre-Disaster Preparation
|
||||
- [ ] Verify backups running daily
|
||||
- [ ] Test restore procedure monthly
|
||||
- [ ] Document all credentials
|
||||
- [ ] Keep hardware spares (cables, drives)
|
||||
- [ ] Maintain off-site config copies
|
||||
|
||||
### Post-Recovery Validation
|
||||
- [ ] All nodes online: `docker node ls`
|
||||
- [ ] All services running: `docker service ls`
|
||||
- [ ] Health checks passing: `docker ps --filter health=healthy`
|
||||
- [ ] DNS resolving correctly
|
||||
- [ ] Monitoring active (Grafana accessible)
|
||||
- [ ] Backups resumed: `systemctl status restic-backup.timer`
|
||||
- [ ] fail2ban protecting: `fail2ban-client status`
|
||||
- [ ] Network performance normal: `bash network_performance_test.sh`
|
||||
|
||||
---
|
||||
|
||||
## Automation for Faster Recovery
|
||||
|
||||
### Create Recovery USB Drive
|
||||
|
||||
```bash
|
||||
# Copy all scripts and configs
|
||||
mkdir /mnt/usb/homelab-recovery
|
||||
cp -r /workspace/homelab/* /mnt/usb/homelab-recovery/
|
||||
|
||||
# Include documentation
|
||||
cp /workspace/homelab/docs/guides/* /mnt/usb/homelab-recovery/docs/
|
||||
|
||||
# Store credentials (encrypted)
|
||||
# Use GPG or similar to encrypt sensitive files
|
||||
```
|
||||
|
||||
### Quick Deploy Script
|
||||
|
||||
```bash
|
||||
# Run from recovery USB
|
||||
sudo bash /mnt/usb/homelab-recovery/scripts/deploy_all.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
This guide should be reviewed and updated quarterly to ensure accuracy.
|
||||
Reference in New Issue
Block a user