Initial commit: homelab configuration and documentation

2025-11-29 19:03:14 +00:00
commit 0769ca6888
72 changed files with 7806 additions and 0 deletions
--- a/docs/guides/DISASTER_RECOVERY.md
+++ b/docs/guides/DISASTER_RECOVERY.md
@@ -0,0 +1,375 @@
+# Disaster Recovery Guide
+
+## Overview
+This guide provides procedures for recovering from various failure scenarios in the homelab.
+
+## Quick Recovery Matrix
+
+| Scenario | Impact | Recovery Time | Procedure |
+|----------|--------|---------------|-----------|
+| Single node failure | Partial | < 5 min | [Node Failure](#node-failure) |
+| Manager node down | Service disruption | < 10 min | [Manager Recovery](#manager-node-recovery) |
+| Storage failure | Data risk | < 30 min | [Storage Recovery](#storage-failure) |
+| Network outage | Complete | < 15 min | [Network Recovery](#network-recovery) |
+| Complete disaster | Full rebuild | < 2 hours | [Full Recovery](#complete-disaster-recovery) |
+
+---
+
+## Node Failure
+
+### Symptoms
+- Node unreachable via SSH
+- Docker services not running on node
+- Swarm reports node as "Down"
+
+### Recovery Steps
+
+1. **Verify node status**:
+   ```bash
+   docker node ls
+   # Look for "Down" status
+   ```
+
+2. **Attempt to restart node** (if accessible):
+   ```bash
+   ssh user@<node-ip>
+   sudo reboot
+   ```
+
+3. **If node is unrecoverable**:
+   ```bash
+   # Remove from Swarm
+   docker node rm <node-id> --force
+   
+   # Services will automatically reschedule to healthy nodes
+   ```
+
+4. **Add replacement node**:
+   ```bash
+   # On manager node, get join token
+   docker swarm join-token worker
+   
+   # On new node, join swarm
+   docker swarm join --token <token> 192.168.1.196:2377
+   ```
+
+---
+
+## Manager Node Recovery
+
+### Symptoms
+- Cannot access Portainer UI
+- Swarm commands fail
+- DNS services disrupted
+
+### Recovery Steps
+
+1. **Promote a worker to manager** (from another manager if available):
+   ```bash
+   docker node promote <worker-node-id>
+   ```
+
+2. **Restore from backup**:
+   ```bash
+   # Stop Docker on failed manager
+   sudo systemctl stop docker
+   
+   # Restore Portainer data
+   restic restore latest --target /tmp/restore
+   sudo cp -r /tmp/restore/portainer /var/lib/docker/volumes/portainer/_data/
+   
+   # Start Docker
+   sudo systemctl start docker
+   ```
+
+3. **Reconfigure DNS** (if Pi-hole affected):
+   ```bash
+   # Temporarily point router DNS to another Pi-hole instance
+   # Update router DNS to: 192.168.1.245, 192.168.1.62
+   ```
+
+---
+
+## Storage Failure
+
+### ZFS Pool Failure
+
+#### Symptoms
+- `zpool status` shows DEGRADED or FAULTED
+- I/O errors in logs
+
+#### Recovery Steps
+
+1. **Check pool status**:
+   ```bash
+   zpool status tank
+   ```
+
+2. **If disk failed**:
+   ```bash
+   # Replace failed disk
+   zpool replace tank /dev/old-disk /dev/new-disk
+   
+   # Monitor resilver progress
+   watch zpool status tank
+   ```
+
+3. **If pool is destroyed**:
+   ```bash
+   # Recreate pool
+   bash /workspace/homelab/scripts/zfs_setup.sh
+   
+   # Restore from backup
+   restic restore latest --target /tank/docker
+   ```
+
+### NAS Failure
+
+#### Recovery Steps
+
+1. **Check NAS connectivity**:
+   ```bash
+   ping 192.168.1.200
+   mount | grep /mnt/nas
+   ```
+
+2. **Remount NAS**:
+   ```bash
+   sudo umount /mnt/nas
+   sudo mount -a
+   ```
+
+3. **If NAS hardware failed**:
+   - Services using NAS volumes will fail
+   - Redeploy services to use local storage temporarily
+   - Restore NAS from Time Capsule backup
+
+---
+
+## Network Recovery
+
+### Complete Network Outage
+
+#### Recovery Steps
+
+1. **Check physical connections**:
+   - Verify all cables connected
+   - Check switch power and status LEDs
+   - Restart switch
+
+2. **Verify router**:
+   ```bash
+   ping 192.168.1.1
+   # If no response, restart router
+   ```
+
+3. **Check VLAN configuration**:
+   ```bash
+   ip -d link show
+   # Reapply if needed
+   bash /workspace/homelab/scripts/vlan_firewall.sh
+   ```
+
+4. **Restart networking**:
+   ```bash
+   sudo systemctl restart networking
+   # Or on each node:
+   sudo reboot
+   ```
+
+### Partial Network Issues
+
+#### DNS Not Resolving
+
+```bash
+# Check Pi-hole status
+docker ps | grep pihole
+
+# Restart Pi-hole
+docker restart <pihole-container>
+
+# Temporarily use public DNS
+sudo echo "nameserver 8.8.8.8" > /etc/resolv.conf
+```
+
+#### Traefik Not Routing
+
+```bash
+# Check Traefik service
+docker service ls | grep traefik
+docker service ps traefik_traefik
+
+# Check logs
+docker service logs traefik_traefik
+
+# Force update
+docker service update --force traefik_traefik
+```
+
+---
+
+## Complete Disaster Recovery
+
+### Scenario: Total Infrastructure Loss
+
+#### Prerequisites
+- Restic backups to Backblaze B2 (off-site)
+- Hardware replacement available
+- Network infrastructure functional
+
+#### Recovery Steps
+
+1. **Rebuild Core Infrastructure** (2-4 hours):
+   
+   ```bash
+   # Install base OS on all nodes
+   # Configure network (static IPs, hostnames)
+   
+   # Install Docker on all nodes
+   curl -fsSL https://get.docker.com | sh
+   sudo usermod -aG docker $USER
+   
+   # Initialize Swarm on manager
+   docker swarm init --advertise-addr 192.168.1.196
+   
+   # Join workers
+   docker swarm join-token worker  # Get token
+   # Run on each worker with token
+   ```
+
+2. **Restore Storage**:
+   
+   ```bash
+   # Recreate ZFS pool
+   bash /workspace/homelab/scripts/zfs_setup.sh
+   
+   # Mount NAS
+   # Follow: /workspace/homelab/docs/guides/NAS_Mount_Guide.md
+   ```
+
+3. **Restore from Backups**:
+   
+   ```bash
+   # Install restic
+   sudo apt-get install restic
+   
+   # Configure credentials
+   export B2_ACCOUNT_ID="..."
+   export B2_ACCOUNT_KEY="..."
+   export RESTIC_REPOSITORY="b2:bucket:/backups"
+   export RESTIC_PASSWORD="..."
+   
+   # List snapshots
+   restic snapshots
+   
+   # Restore latest
+   restic restore latest --target /tmp/restore
+   
+   # Copy to Docker volumes
+   sudo cp -r /tmp/restore/* /var/lib/docker/volumes/
+   ```
+
+4. **Redeploy Services**:
+   
+   ```bash
+   # Deploy all stacks
+   bash /workspace/homelab/scripts/deploy_all.sh
+   
+   # Verify deployment
+   bash /workspace/homelab/scripts/validate_deployment.sh
+   ```
+
+5. **Verify Recovery**:
+   
+   - Check all services: `docker service ls`
+   - Test Traefik routing: `curl https://your-domain.com`
+   - Verify Portainer UI access
+   - Check Grafana dashboards
+   - Test Home Assistant
+
+---
+
+## Backup Verification
+
+### Monthly Backup Test
+
+```bash
+# List snapshots
+restic snapshots
+
+# Verify specific snapshot
+restic check --read-data-subset=10%
+
+# Test restore
+mkdir /tmp/restore-test
+restic restore <snapshot-id> --target /tmp/restore-test --include /path/to/critical/file
+
+# Compare with original
+diff -r /tmp/restore-test /original/path
+```
+
+---
+
+## Emergency Contacts & Resources
+
+### Critical Information
+- **Backblaze B2 Login**: Store credentials in password manager
+- **restic Password**: Store securely (CANNOT be recovered)
+- **Router Admin**: Keep credentials accessible
+- **ISP Support**: Keep contact info handy
+
+### Documentation URLs
+- Docker Swarm: https://docs.docker.com/engine/swarm/
+- Traefik: https://doc.traefik.io/traefik/
+- Restic: https://restic.readthedocs.io/
+- ZFS: https://openzfs.github.io/openzfs-docs/
+
+---
+
+## Recovery Checklists
+
+### Pre-Disaster Preparation
+- [ ] Verify backups running daily
+- [ ] Test restore procedure monthly
+- [ ] Document all credentials
+- [ ] Keep hardware spares (cables, drives)
+- [ ] Maintain off-site config copies
+
+### Post-Recovery Validation
+- [ ] All nodes online: `docker node ls`
+- [ ] All services running: `docker service ls`
+- [ ] Health checks passing: `docker ps --filter health=healthy`
+- [ ] DNS resolving correctly
+- [ ] Monitoring active (Grafana accessible)
+- [ ] Backups resumed: `systemctl status restic-backup.timer`
+- [ ] fail2ban protecting: `fail2ban-client status`
+- [ ] Network performance normal: `bash network_performance_test.sh`
+
+---
+
+## Automation for Faster Recovery
+
+### Create Recovery USB Drive
+
+```bash
+# Copy all scripts and configs
+mkdir /mnt/usb/homelab-recovery
+cp -r /workspace/homelab/* /mnt/usb/homelab-recovery/
+
+# Include documentation
+cp /workspace/homelab/docs/guides/* /mnt/usb/homelab-recovery/docs/
+
+# Store credentials (encrypted)
+# Use GPG or similar to encrypt sensitive files
+```
+
+### Quick Deploy Script
+
+```bash
+# Run from recovery USB
+sudo bash /mnt/usb/homelab-recovery/scripts/deploy_all.sh
+```
+
+---
+
+This guide should be reviewed and updated quarterly to ensure accuracy.