# Disaster Recovery Guide ## Overview This guide provides procedures for recovering from various failure scenarios in the homelab. ## Quick Recovery Matrix | Scenario | Impact | Recovery Time | Procedure | |----------|--------|---------------|-----------| | Single node failure | Partial | < 5 min | [Node Failure](#node-failure) | | Manager node down | Service disruption | < 10 min | [Manager Recovery](#manager-node-recovery) | | Storage failure | Data risk | < 30 min | [Storage Recovery](#storage-failure) | | Network outage | Complete | < 15 min | [Network Recovery](#network-recovery) | | Complete disaster | Full rebuild | < 2 hours | [Full Recovery](#complete-disaster-recovery) | --- ## Node Failure ### Symptoms - Node unreachable via SSH - Docker services not running on node - Swarm reports node as "Down" ### Recovery Steps 1. **Verify node status**: ```bash docker node ls # Look for "Down" status ``` 2. **Attempt to restart node** (if accessible): ```bash ssh user@ sudo reboot ``` 3. **If node is unrecoverable**: ```bash # Remove from Swarm docker node rm --force # Services will automatically reschedule to healthy nodes ``` 4. **Add replacement node**: ```bash # On manager node, get join token docker swarm join-token worker # On new node, join swarm docker swarm join --token 192.168.1.196:2377 ``` --- ## Manager Node Recovery ### Symptoms - Cannot access Portainer UI - Swarm commands fail - DNS services disrupted ### Recovery Steps 1. **Promote a worker to manager** (from another manager if available): ```bash docker node promote ``` 2. **Restore from backup**: ```bash # Stop Docker on failed manager sudo systemctl stop docker # Restore Portainer data restic restore latest --target /tmp/restore sudo cp -r /tmp/restore/portainer /var/lib/docker/volumes/portainer/_data/ # Start Docker sudo systemctl start docker ``` 3. **Reconfigure DNS** (if Pi-hole affected): ```bash # Temporarily point router DNS to another Pi-hole instance # Update router DNS to: 192.168.1.245, 192.168.1.62 ``` --- ## Storage Failure ### ZFS Pool Failure #### Symptoms - `zpool status` shows DEGRADED or FAULTED - I/O errors in logs #### Recovery Steps 1. **Check pool status**: ```bash zpool status tank ``` 2. **If disk failed**: ```bash # Replace failed disk zpool replace tank /dev/old-disk /dev/new-disk # Monitor resilver progress watch zpool status tank ``` 3. **If pool is destroyed**: ```bash # Recreate pool bash /workspace/homelab/scripts/zfs_setup.sh # Restore from backup restic restore latest --target /tank/docker ``` ### NAS Failure #### Recovery Steps 1. **Check NAS connectivity**: ```bash ping 192.168.1.200 mount | grep /mnt/nas ``` 2. **Remount NAS**: ```bash sudo umount /mnt/nas sudo mount -a ``` 3. **If NAS hardware failed**: - Services using NAS volumes will fail - Redeploy services to use local storage temporarily - Restore NAS from Time Capsule backup --- ## Network Recovery ### Complete Network Outage #### Recovery Steps 1. **Check physical connections**: - Verify all cables connected - Check switch power and status LEDs - Restart switch 2. **Verify router**: ```bash ping 192.168.1.1 # If no response, restart router ``` 3. **Check VLAN configuration**: ```bash ip -d link show # Reapply if needed bash /workspace/homelab/scripts/vlan_firewall.sh ``` 4. **Restart networking**: ```bash sudo systemctl restart networking # Or on each node: sudo reboot ``` ### Partial Network Issues #### DNS Not Resolving ```bash # Check Pi-hole status docker ps | grep pihole # Restart Pi-hole docker restart # Temporarily use public DNS sudo echo "nameserver 8.8.8.8" > /etc/resolv.conf ``` #### Traefik Not Routing ```bash # Check Traefik service docker service ls | grep traefik docker service ps traefik_traefik # Check logs docker service logs traefik_traefik # Force update docker service update --force traefik_traefik ``` --- ## Complete Disaster Recovery ### Scenario: Total Infrastructure Loss #### Prerequisites - Restic backups to Backblaze B2 (off-site) - Hardware replacement available - Network infrastructure functional #### Recovery Steps 1. **Rebuild Core Infrastructure** (2-4 hours): ```bash # Install base OS on all nodes # Configure network (static IPs, hostnames) # Install Docker on all nodes curl -fsSL https://get.docker.com | sh sudo usermod -aG docker $USER # Initialize Swarm on manager docker swarm init --advertise-addr 192.168.1.196 # Join workers docker swarm join-token worker # Get token # Run on each worker with token ``` 2. **Restore Storage**: ```bash # Recreate ZFS pool bash /workspace/homelab/scripts/zfs_setup.sh # Mount NAS # Follow: /workspace/homelab/docs/guides/NAS_Mount_Guide.md ``` 3. **Restore from Backups**: ```bash # Install restic sudo apt-get install restic # Configure credentials export B2_ACCOUNT_ID="..." export B2_ACCOUNT_KEY="..." export RESTIC_REPOSITORY="b2:bucket:/backups" export RESTIC_PASSWORD="..." # List snapshots restic snapshots # Restore latest restic restore latest --target /tmp/restore # Copy to Docker volumes sudo cp -r /tmp/restore/* /var/lib/docker/volumes/ ``` 4. **Redeploy Services**: ```bash # Deploy all stacks bash /workspace/homelab/scripts/deploy_all.sh # Verify deployment bash /workspace/homelab/scripts/validate_deployment.sh ``` 5. **Verify Recovery**: - Check all services: `docker service ls` - Test Traefik routing: `curl https://your-domain.com` - Verify Portainer UI access - Check Grafana dashboards - Test Home Assistant --- ## Backup Verification ### Monthly Backup Test ```bash # List snapshots restic snapshots # Verify specific snapshot restic check --read-data-subset=10% # Test restore mkdir /tmp/restore-test restic restore --target /tmp/restore-test --include /path/to/critical/file # Compare with original diff -r /tmp/restore-test /original/path ``` --- ## Emergency Contacts & Resources ### Critical Information - **Backblaze B2 Login**: Store credentials in password manager - **restic Password**: Store securely (CANNOT be recovered) - **Router Admin**: Keep credentials accessible - **ISP Support**: Keep contact info handy ### Documentation URLs - Docker Swarm: https://docs.docker.com/engine/swarm/ - Traefik: https://doc.traefik.io/traefik/ - Restic: https://restic.readthedocs.io/ - ZFS: https://openzfs.github.io/openzfs-docs/ --- ## Recovery Checklists ### Pre-Disaster Preparation - [ ] Verify backups running daily - [ ] Test restore procedure monthly - [ ] Document all credentials - [ ] Keep hardware spares (cables, drives) - [ ] Maintain off-site config copies ### Post-Recovery Validation - [ ] All nodes online: `docker node ls` - [ ] All services running: `docker service ls` - [ ] Health checks passing: `docker ps --filter health=healthy` - [ ] DNS resolving correctly - [ ] Monitoring active (Grafana accessible) - [ ] Backups resumed: `systemctl status restic-backup.timer` - [ ] fail2ban protecting: `fail2ban-client status` - [ ] Network performance normal: `bash network_performance_test.sh` --- ## Automation for Faster Recovery ### Create Recovery USB Drive ```bash # Copy all scripts and configs mkdir /mnt/usb/homelab-recovery cp -r /workspace/homelab/* /mnt/usb/homelab-recovery/ # Include documentation cp /workspace/homelab/docs/guides/* /mnt/usb/homelab-recovery/docs/ # Store credentials (encrypted) # Use GPG or similar to encrypt sensitive files ``` ### Quick Deploy Script ```bash # Run from recovery USB sudo bash /mnt/usb/homelab-recovery/scripts/deploy_all.sh ``` --- This guide should be reviewed and updated quarterly to ensure accuracy.