Files
Homelab/docs/guides/DISASTER_RECOVERY.md

7.9 KiB

Disaster Recovery Guide

Overview

This guide provides procedures for recovering from various failure scenarios in the homelab.

Quick Recovery Matrix

Scenario Impact Recovery Time Procedure
Single node failure Partial < 5 min Node Failure
Manager node down Service disruption < 10 min Manager Recovery
Storage failure Data risk < 30 min Storage Recovery
Network outage Complete < 15 min Network Recovery
Complete disaster Full rebuild < 2 hours Full Recovery

Node Failure

Symptoms

  • Node unreachable via SSH
  • Docker services not running on node
  • Swarm reports node as "Down"

Recovery Steps

  1. Verify node status:

    docker node ls
    # Look for "Down" status
    
  2. Attempt to restart node (if accessible):

    ssh user@<node-ip>
    sudo reboot
    
  3. If node is unrecoverable:

    # Remove from Swarm
    docker node rm <node-id> --force
    
    # Services will automatically reschedule to healthy nodes
    
  4. Add replacement node:

    # On manager node, get join token
    docker swarm join-token worker
    
    # On new node, join swarm
    docker swarm join --token <token> 192.168.1.196:2377
    

Manager Node Recovery

Symptoms

  • Cannot access Portainer UI
  • Swarm commands fail
  • DNS services disrupted

Recovery Steps

  1. Promote a worker to manager (from another manager if available):

    docker node promote <worker-node-id>
    
  2. Restore from backup:

    # Stop Docker on failed manager
    sudo systemctl stop docker
    
    # Restore Portainer data
    restic restore latest --target /tmp/restore
    sudo cp -r /tmp/restore/portainer /var/lib/docker/volumes/portainer/_data/
    
    # Start Docker
    sudo systemctl start docker
    
  3. Reconfigure DNS (if Pi-hole affected):

    # Temporarily point router DNS to another Pi-hole instance
    # Update router DNS to: 192.168.1.245, 192.168.1.62
    

Storage Failure

ZFS Pool Failure

Symptoms

  • zpool status shows DEGRADED or FAULTED
  • I/O errors in logs

Recovery Steps

  1. Check pool status:

    zpool status tank
    
  2. If disk failed:

    # Replace failed disk
    zpool replace tank /dev/old-disk /dev/new-disk
    
    # Monitor resilver progress
    watch zpool status tank
    
  3. If pool is destroyed:

    # Recreate pool
    bash /workspace/homelab/scripts/zfs_setup.sh
    
    # Restore from backup
    restic restore latest --target /tank/docker
    

NAS Failure

Recovery Steps

  1. Check NAS connectivity:

    ping 192.168.1.200
    mount | grep /mnt/nas
    
  2. Remount NAS:

    sudo umount /mnt/nas
    sudo mount -a
    
  3. If NAS hardware failed:

    • Services using NAS volumes will fail
    • Redeploy services to use local storage temporarily
    • Restore NAS from Time Capsule backup

Network Recovery

Complete Network Outage

Recovery Steps

  1. Check physical connections:

    • Verify all cables connected
    • Check switch power and status LEDs
    • Restart switch
  2. Verify router:

    ping 192.168.1.1
    # If no response, restart router
    
  3. Check VLAN configuration:

    ip -d link show
    # Reapply if needed
    bash /workspace/homelab/scripts/vlan_firewall.sh
    
  4. Restart networking:

    sudo systemctl restart networking
    # Or on each node:
    sudo reboot
    

Partial Network Issues

DNS Not Resolving

# Check Pi-hole status
docker ps | grep pihole

# Restart Pi-hole
docker restart <pihole-container>

# Temporarily use public DNS
sudo echo "nameserver 8.8.8.8" > /etc/resolv.conf

Traefik Not Routing

# Check Traefik service
docker service ls | grep traefik
docker service ps traefik_traefik

# Check logs
docker service logs traefik_traefik

# Force update
docker service update --force traefik_traefik

Complete Disaster Recovery

Scenario: Total Infrastructure Loss

Prerequisites

  • Restic backups to Backblaze B2 (off-site)
  • Hardware replacement available
  • Network infrastructure functional

Recovery Steps

  1. Rebuild Core Infrastructure (2-4 hours):

    # Install base OS on all nodes
    # Configure network (static IPs, hostnames)
    
    # Install Docker on all nodes
    curl -fsSL https://get.docker.com | sh
    sudo usermod -aG docker $USER
    
    # Initialize Swarm on manager
    docker swarm init --advertise-addr 192.168.1.196
    
    # Join workers
    docker swarm join-token worker  # Get token
    # Run on each worker with token
    
  2. Restore Storage:

    # Recreate ZFS pool
    bash /workspace/homelab/scripts/zfs_setup.sh
    
    # Mount NAS
    # Follow: /workspace/homelab/docs/guides/NAS_Mount_Guide.md
    
  3. Restore from Backups:

    # Install restic
    sudo apt-get install restic
    
    # Configure credentials
    export B2_ACCOUNT_ID="..."
    export B2_ACCOUNT_KEY="..."
    export RESTIC_REPOSITORY="b2:bucket:/backups"
    export RESTIC_PASSWORD="..."
    
    # List snapshots
    restic snapshots
    
    # Restore latest
    restic restore latest --target /tmp/restore
    
    # Copy to Docker volumes
    sudo cp -r /tmp/restore/* /var/lib/docker/volumes/
    
  4. Redeploy Services:

    # Deploy all stacks
    bash /workspace/homelab/scripts/deploy_all.sh
    
    # Verify deployment
    bash /workspace/homelab/scripts/validate_deployment.sh
    
  5. Verify Recovery:

    • Check all services: docker service ls
    • Test Traefik routing: curl https://your-domain.com
    • Verify Portainer UI access
    • Check Grafana dashboards
    • Test Home Assistant

Backup Verification

Monthly Backup Test

# List snapshots
restic snapshots

# Verify specific snapshot
restic check --read-data-subset=10%

# Test restore
mkdir /tmp/restore-test
restic restore <snapshot-id> --target /tmp/restore-test --include /path/to/critical/file

# Compare with original
diff -r /tmp/restore-test /original/path

Emergency Contacts & Resources

Critical Information

  • Backblaze B2 Login: Store credentials in password manager
  • restic Password: Store securely (CANNOT be recovered)
  • Router Admin: Keep credentials accessible
  • ISP Support: Keep contact info handy

Documentation URLs


Recovery Checklists

Pre-Disaster Preparation

  • Verify backups running daily
  • Test restore procedure monthly
  • Document all credentials
  • Keep hardware spares (cables, drives)
  • Maintain off-site config copies

Post-Recovery Validation

  • All nodes online: docker node ls
  • All services running: docker service ls
  • Health checks passing: docker ps --filter health=healthy
  • DNS resolving correctly
  • Monitoring active (Grafana accessible)
  • Backups resumed: systemctl status restic-backup.timer
  • fail2ban protecting: fail2ban-client status
  • Network performance normal: bash network_performance_test.sh

Automation for Faster Recovery

Create Recovery USB Drive

# Copy all scripts and configs
mkdir /mnt/usb/homelab-recovery
cp -r /workspace/homelab/* /mnt/usb/homelab-recovery/

# Include documentation
cp /workspace/homelab/docs/guides/* /mnt/usb/homelab-recovery/docs/

# Store credentials (encrypted)
# Use GPG or similar to encrypt sensitive files

Quick Deploy Script

# Run from recovery USB
sudo bash /mnt/usb/homelab-recovery/scripts/deploy_all.sh

This guide should be reviewed and updated quarterly to ensure accuracy.