sterl/Homelab

Fork 0

Files

sterlenjohnson 0769ca6888 Initial commit: homelab configuration and documentation

2025-11-29 19:03:14 +00:00

7.9 KiB

Raw Blame History

Disaster Recovery Guide

Overview

This guide provides procedures for recovering from various failure scenarios in the homelab.

Quick Recovery Matrix

Scenario	Impact	Recovery Time	Procedure
Single node failure	Partial	< 5 min	Node Failure
Manager node down	Service disruption	< 10 min	Manager Recovery
Storage failure	Data risk	< 30 min	Storage Recovery
Network outage	Complete	< 15 min	Network Recovery
Complete disaster	Full rebuild	< 2 hours	Full Recovery

Node Failure

Symptoms

Node unreachable via SSH
Docker services not running on node
Swarm reports node as "Down"

Recovery Steps

Verify node status:

docker node ls
# Look for "Down" status

Attempt to restart node (if accessible):
```
ssh user@<node-ip>
sudo reboot
```

If node is unrecoverable:

# Remove from Swarm
docker node rm <node-id> --force

# Services will automatically reschedule to healthy nodes

Add replacement node:

# On manager node, get join token
docker swarm join-token worker

# On new node, join swarm
docker swarm join --token <token> 192.168.1.196:2377

Manager Node Recovery

Symptoms

Cannot access Portainer UI
Swarm commands fail
DNS services disrupted

Recovery Steps

Promote a worker to manager (from another manager if available):
```
docker node promote <worker-node-id>
```

Restore from backup:

# Stop Docker on failed manager
sudo systemctl stop docker

# Restore Portainer data
restic restore latest --target /tmp/restore
sudo cp -r /tmp/restore/portainer /var/lib/docker/volumes/portainer/_data/

# Start Docker
sudo systemctl start docker

Reconfigure DNS (if Pi-hole affected):

# Temporarily point router DNS to another Pi-hole instance
# Update router DNS to: 192.168.1.245, 192.168.1.62

Storage Failure

ZFS Pool Failure

Symptoms

zpool status shows DEGRADED or FAULTED
I/O errors in logs

Recovery Steps

Check pool status:
```
zpool status tank
```

If disk failed:

# Replace failed disk
zpool replace tank /dev/old-disk /dev/new-disk

# Monitor resilver progress
watch zpool status tank

If pool is destroyed:

# Recreate pool
bash /workspace/homelab/scripts/zfs_setup.sh

# Restore from backup
restic restore latest --target /tank/docker

NAS Failure

Recovery Steps

Check NAS connectivity:

ping 192.168.1.200
mount | grep /mnt/nas

Remount NAS:
```
sudo umount /mnt/nas
sudo mount -a
```
If NAS hardware failed:
- Services using NAS volumes will fail
- Redeploy services to use local storage temporarily
- Restore NAS from Time Capsule backup

Network Recovery

Complete Network Outage

Recovery Steps

Check physical connections:
- Verify all cables connected
- Check switch power and status LEDs
- Restart switch

Verify router:

ping 192.168.1.1
# If no response, restart router

Check VLAN configuration:

ip -d link show
# Reapply if needed
bash /workspace/homelab/scripts/vlan_firewall.sh

Restart networking:

sudo systemctl restart networking
# Or on each node:
sudo reboot

Partial Network Issues

DNS Not Resolving

# Check Pi-hole status
docker ps | grep pihole

# Restart Pi-hole
docker restart <pihole-container>

# Temporarily use public DNS
sudo echo "nameserver 8.8.8.8" > /etc/resolv.conf

Traefik Not Routing

# Check Traefik service
docker service ls | grep traefik
docker service ps traefik_traefik

# Check logs
docker service logs traefik_traefik

# Force update
docker service update --force traefik_traefik

Complete Disaster Recovery

Scenario: Total Infrastructure Loss

Prerequisites

Restic backups to Backblaze B2 (off-site)
Hardware replacement available
Network infrastructure functional

Recovery Steps

Rebuild Core Infrastructure (2-4 hours):

# Install base OS on all nodes
# Configure network (static IPs, hostnames)

# Install Docker on all nodes
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER

# Initialize Swarm on manager
docker swarm init --advertise-addr 192.168.1.196

# Join workers
docker swarm join-token worker  # Get token
# Run on each worker with token

Restore Storage:

# Recreate ZFS pool
bash /workspace/homelab/scripts/zfs_setup.sh

# Mount NAS
# Follow: /workspace/homelab/docs/guides/NAS_Mount_Guide.md

Restore from Backups:

# Install restic
sudo apt-get install restic

# Configure credentials
export B2_ACCOUNT_ID="..."
export B2_ACCOUNT_KEY="..."
export RESTIC_REPOSITORY="b2:bucket:/backups"
export RESTIC_PASSWORD="..."

# List snapshots
restic snapshots

# Restore latest
restic restore latest --target /tmp/restore

# Copy to Docker volumes
sudo cp -r /tmp/restore/* /var/lib/docker/volumes/

Redeploy Services:

# Deploy all stacks
bash /workspace/homelab/scripts/deploy_all.sh

# Verify deployment
bash /workspace/homelab/scripts/validate_deployment.sh

Verify Recovery:
- Check all services: docker service ls
- Test Traefik routing: curl https://your-domain.com
- Verify Portainer UI access
- Check Grafana dashboards
- Test Home Assistant

Backup Verification

Monthly Backup Test

# List snapshots
restic snapshots

# Verify specific snapshot
restic check --read-data-subset=10%

# Test restore
mkdir /tmp/restore-test
restic restore <snapshot-id> --target /tmp/restore-test --include /path/to/critical/file

# Compare with original
diff -r /tmp/restore-test /original/path

Emergency Contacts & Resources

Critical Information

Backblaze B2 Login: Store credentials in password manager
restic Password: Store securely (CANNOT be recovered)
Router Admin: Keep credentials accessible
ISP Support: Keep contact info handy

Documentation URLs

Docker Swarm: https://docs.docker.com/engine/swarm/
Traefik: https://doc.traefik.io/traefik/
Restic: https://restic.readthedocs.io/
ZFS: https://openzfs.github.io/openzfs-docs/

Recovery Checklists

Pre-Disaster Preparation

Verify backups running daily
Test restore procedure monthly
Document all credentials
Keep hardware spares (cables, drives)
Maintain off-site config copies

Post-Recovery Validation

All nodes online: docker node ls
All services running: docker service ls
Health checks passing: docker ps --filter health=healthy
DNS resolving correctly
Monitoring active (Grafana accessible)
Backups resumed: systemctl status restic-backup.timer
fail2ban protecting: fail2ban-client status
Network performance normal: bash network_performance_test.sh

Automation for Faster Recovery

Create Recovery USB Drive

# Copy all scripts and configs
mkdir /mnt/usb/homelab-recovery
cp -r /workspace/homelab/* /mnt/usb/homelab-recovery/

# Include documentation
cp /workspace/homelab/docs/guides/* /mnt/usb/homelab-recovery/docs/

# Store credentials (encrypted)
# Use GPG or similar to encrypt sensitive files

Quick Deploy Script

# Run from recovery USB
sudo bash /mnt/usb/homelab-recovery/scripts/deploy_all.sh

This guide should be reviewed and updated quarterly to ensure accuracy.

7.9 KiB Raw Blame History

Disaster Recovery Guide

Overview

Quick Recovery Matrix

Node Failure

Symptoms

Recovery Steps

Manager Node Recovery

Symptoms

Recovery Steps

Storage Failure

ZFS Pool Failure

Symptoms

Recovery Steps

NAS Failure

Recovery Steps

Network Recovery

Complete Network Outage

Recovery Steps

Partial Network Issues

DNS Not Resolving

Traefik Not Routing

Complete Disaster Recovery

Scenario: Total Infrastructure Loss

Prerequisites

Recovery Steps

Backup Verification

Monthly Backup Test

Emergency Contacts & Resources

Critical Information

Documentation URLs

Recovery Checklists

Pre-Disaster Preparation

Post-Recovery Validation

Automation for Faster Recovery

Create Recovery USB Drive

Quick Deploy Script

7.9 KiB

Raw Blame History