Homelab/docs/guides/DISASTER_RECOVERY.md

# Disaster Recovery Guide

## Overview
This guide provides procedures for recovering from various failure scenarios in the homelab.

## Quick Recovery Matrix

| Scenario | Impact | Recovery Time | Procedure |
|----------|--------|---------------|-----------|
| Single node failure | Partial | < 5 min | [Node Failure](#node-failure) |
| Manager node down | Service disruption | < 10 min | [Manager Recovery](#manager-node-recovery) |
| Storage failure | Data risk | < 30 min | [Storage Recovery](#storage-failure) |
| Network outage | Complete | < 15 min | [Network Recovery](#network-recovery) |
| Complete disaster | Full rebuild | < 2 hours | [Full Recovery](#complete-disaster-recovery) |

---

## Node Failure

### Symptoms
- Node unreachable via SSH
- Docker services not running on node
- Swarm reports node as "Down"

### Recovery Steps

1. **Verify node status**:
   ```bash
   docker node ls
   # Look for "Down" status
   ```

2. **Attempt to restart node** (if accessible):
   ```bash
   ssh user@<node-ip>
   sudo reboot
   ```

3. **If node is unrecoverable**:
   ```bash
   # Remove from Swarm
   docker node rm <node-id> --force

   # Services will automatically reschedule to healthy nodes
   ```

4. **Add replacement node**:
   ```bash
   # On manager node, get join token
   docker swarm join-token worker

   # On new node, join swarm
   docker swarm join --token <token> 192.168.1.196:2377
   ```

---

## Manager Node Recovery

### Symptoms
- Cannot access Portainer UI
- Swarm commands fail
- DNS services disrupted

### Recovery Steps

1. **Promote a worker to manager** (from another manager if available):
   ```bash
   docker node promote <worker-node-id>
   ```

2. **Restore from backup**:
   ```bash
   # Stop Docker on failed manager
   sudo systemctl stop docker

   # Restore Portainer data
   restic restore latest --target /tmp/restore
   sudo cp -r /tmp/restore/portainer /var/lib/docker/volumes/portainer/_data/

   # Start Docker
   sudo systemctl start docker
   ```

3. **Reconfigure DNS** (if Pi-hole affected):
   ```bash
   # Temporarily point router DNS to another Pi-hole instance
   # Update router DNS to: 192.168.1.245, 192.168.1.62
   ```

---

## Storage Failure

### ZFS Pool Failure

#### Symptoms
- `zpool status` shows DEGRADED or FAULTED
- I/O errors in logs

#### Recovery Steps

1. **Check pool status**:
   ```bash
   zpool status tank
   ```

2. **If disk failed**:
   ```bash
   # Replace failed disk
   zpool replace tank /dev/old-disk /dev/new-disk

   # Monitor resilver progress
   watch zpool status tank
   ```

3. **If pool is destroyed**:
   ```bash
   # Recreate pool
   bash /workspace/homelab/scripts/zfs_setup.sh

   # Restore from backup
   restic restore latest --target /tank/docker
   ```

### NAS Failure

#### Recovery Steps

1. **Check NAS connectivity**:
   ```bash
   ping 192.168.1.200
   mount | grep /mnt/nas
   ```

2. **Remount NAS**:
   ```bash
   sudo umount /mnt/nas
   sudo mount -a
   ```

3. **If NAS hardware failed**:
   - Services using NAS volumes will fail
   - Redeploy services to use local storage temporarily
   - Restore NAS from Time Capsule backup

---

## Network Recovery

### Complete Network Outage

#### Recovery Steps

1. **Check physical connections**:
   - Verify all cables connected
   - Check switch power and status LEDs
   - Restart switch

2. **Verify router**:
   ```bash
   ping 192.168.1.1
   # If no response, restart router
   ```

3. **Check VLAN configuration**:
   ```bash
   ip -d link show
   # Reapply if needed
   bash /workspace/homelab/scripts/vlan_firewall.sh
   ```

4. **Restart networking**:
   ```bash
   sudo systemctl restart networking
   # Or on each node:
   sudo reboot
   ```

### Partial Network Issues

#### DNS Not Resolving

```bash
# Check Pi-hole status
docker ps | grep pihole

# Restart Pi-hole
docker restart <pihole-container>

# Temporarily use public DNS
sudo echo "nameserver 8.8.8.8" > /etc/resolv.conf
```

#### Traefik Not Routing

```bash
# Check Traefik service
docker service ls | grep traefik
docker service ps traefik_traefik

# Check logs
docker service logs traefik_traefik

# Force update
docker service update --force traefik_traefik
```

---

## Complete Disaster Recovery

### Scenario: Total Infrastructure Loss

#### Prerequisites
- Restic backups to Backblaze B2 (off-site)
- Hardware replacement available
- Network infrastructure functional

#### Recovery Steps

1. **Rebuild Core Infrastructure** (2-4 hours):

   ```bash
   # Install base OS on all nodes
   # Configure network (static IPs, hostnames)

   # Install Docker on all nodes
   curl -fsSL https://get.docker.com | sh
   sudo usermod -aG docker $USER

   # Initialize Swarm on manager
   docker swarm init --advertise-addr 192.168.1.196

   # Join workers
   docker swarm join-token worker  # Get token
   # Run on each worker with token
   ```

2. **Restore Storage**:

   ```bash
   # Recreate ZFS pool
   bash /workspace/homelab/scripts/zfs_setup.sh

   # Mount NAS
   # Follow: /workspace/homelab/docs/guides/NAS_Mount_Guide.md
   ```

3. **Restore from Backups**:

   ```bash
   # Install restic
   sudo apt-get install restic

   # Configure credentials
   export B2_ACCOUNT_ID="..."
   export B2_ACCOUNT_KEY="..."
   export RESTIC_REPOSITORY="b2:bucket:/backups"
   export RESTIC_PASSWORD="..."

   # List snapshots
   restic snapshots

   # Restore latest
   restic restore latest --target /tmp/restore

   # Copy to Docker volumes
   sudo cp -r /tmp/restore/* /var/lib/docker/volumes/
   ```

4. **Redeploy Services**:

   ```bash
   # Deploy all stacks
   bash /workspace/homelab/scripts/deploy_all.sh

   # Verify deployment
   bash /workspace/homelab/scripts/validate_deployment.sh
   ```

5. **Verify Recovery**:

   - Check all services: `docker service ls`
   - Test Traefik routing: `curl https://your-domain.com`
   - Verify Portainer UI access
   - Check Grafana dashboards
   - Test Home Assistant

---

## Backup Verification

### Monthly Backup Test

```bash
# List snapshots
restic snapshots

# Verify specific snapshot
restic check --read-data-subset=10%

# Test restore
mkdir /tmp/restore-test
restic restore <snapshot-id> --target /tmp/restore-test --include /path/to/critical/file

# Compare with original
diff -r /tmp/restore-test /original/path
```

---

## Emergency Contacts & Resources

### Critical Information
- **Backblaze B2 Login**: Store credentials in password manager
- **restic Password**: Store securely (CANNOT be recovered)
- **Router Admin**: Keep credentials accessible
- **ISP Support**: Keep contact info handy

### Documentation URLs
- Docker Swarm: https://docs.docker.com/engine/swarm/
- Traefik: https://doc.traefik.io/traefik/
- Restic: https://restic.readthedocs.io/
- ZFS: https://openzfs.github.io/openzfs-docs/

---

## Recovery Checklists

### Pre-Disaster Preparation
- [ ] Verify backups running daily
- [ ] Test restore procedure monthly
- [ ] Document all credentials
- [ ] Keep hardware spares (cables, drives)
- [ ] Maintain off-site config copies

### Post-Recovery Validation
- [ ] All nodes online: `docker node ls`
- [ ] All services running: `docker service ls`
- [ ] Health checks passing: `docker ps --filter health=healthy`
- [ ] DNS resolving correctly
- [ ] Monitoring active (Grafana accessible)
- [ ] Backups resumed: `systemctl status restic-backup.timer`
- [ ] fail2ban protecting: `fail2ban-client status`
- [ ] Network performance normal: `bash network_performance_test.sh`

---

## Automation for Faster Recovery

### Create Recovery USB Drive

```bash
# Copy all scripts and configs
mkdir /mnt/usb/homelab-recovery
cp -r /workspace/homelab/* /mnt/usb/homelab-recovery/

# Include documentation
cp /workspace/homelab/docs/guides/* /mnt/usb/homelab-recovery/docs/

# Store credentials (encrypted)
# Use GPG or similar to encrypt sensitive files
```

### Quick Deploy Script

```bash
# Run from recovery USB
sudo bash /mnt/usb/homelab-recovery/scripts/deploy_all.sh
```

---

This guide should be reviewed and updated quarterly to ensure accuracy.