Files
Homelab/docs/reviews/SWARM_STACK_REVIEW.md

413 lines
11 KiB
Markdown

# Docker Swarm Stack Files - Review & Recommendations
## Overview
Reviewed 9 Docker Swarm stack files totaling ~24KB of configuration. Found **critical security issues**, configuration inconsistencies, and optimization opportunities.
---
## 🔴 Critical Issues
### 1. **Hardcoded Secrets in Plain Text**
**Files Affected**: [`full-stack-complete.yml`](file:///workspace/homelab/services/swarm/stacks/full-stack-complete.yml), [`monitoring-stack.yml`](file:///workspace/homelab/services/swarm/stacks/monitoring-stack.yml)
**Problems**:
```yaml
# Line 96: Paperless DB password in plain text
- PAPERLESS_DBPASS=paperless
# Line 98: Hardcoded secret key
- PAPERLESS_SECRET_KEY=change-me-please-to-something-secure
# Line 52: Grafana admin password exposed
- GF_SECURITY_ADMIN_PASSWORD=change-me-please
```
**Risk**: Anyone with access to the repo can see credentials. These will be in Docker configs and logs.
**Fix**: Use Docker secrets:
```yaml
secrets:
paperless_db_password:
external: true
paperless_secret_key:
external: true
grafana_admin_password:
external: true
services:
paperless:
secrets:
- paperless_db_password
- paperless_secret_key
environment:
- PAPERLESS_DBPASS_FILE=/run/secrets/paperless_db_password
- PAPERLESS_SECRET_KEY_FILE=/run/secrets/paperless_secret_key
```
### 2. **Missing Health Checks**
**Files Affected**: All stack files
**Problem**: No services have health checks configured, meaning:
- Swarm can't detect unhealthy containers
- Auto-restart won't work properly
- Load balancers may route to failing instances
**Fix**: Add health checks to critical services:
```yaml
services:
paperless:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
```
### 3. **Incorrect node-exporter Command**
**File**: [`monitoring-stack.yml:111-114`](file:///workspace/homelab/services/swarm/stacks/monitoring-stack.yml#L111-L114)
**Problem**:
```yaml
command:
- '--config.file=/etc/prometheus/prometheus.yml' # Wrong! This is for Prometheus
- '--storage.tsdb.path=/prometheus' # Wrong!
```
**Fix**:
```yaml
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
```
---
## ⚠️ High-Priority Warnings
### 4. **Missing Networks on Database Services**
**File**: [`full-stack-complete.yml`](file:///workspace/homelab/services/swarm/stacks/full-stack-complete.yml)
**Problem**: `paperless-db` (line 70) doesn't have a network defined, but Paperless tries to connect to it.
**Fix**:
```yaml
paperless-db:
networks:
- homelab-backend # Add this
```
### 5. **Resource Limits Too High for Pi Zero**
**File**: [`full-stack-complete.yml`](file:///workspace/homelab/services/swarm/stacks/full-stack-complete.yml)
**Problem**: Services with `node.labels.leader == true` (Pi 4) have resource limits that may be too high:
- Paperless: 2GB memory (Pi 4 has 8GB total)
- Stirling-PDF: 2GB memory
- SearXNG: 2GB memory
- Combined: 6GB+ on one node
**Fix**: Reduce limits or spread services across nodes:
```yaml
deploy:
placement:
constraints:
- node.labels.leader == true
- node.memory.available > 2G # Add memory check
```
### 6. **Duplicate Portainer Definitions**
**Files**: [`portainer-stack.yml`](file:///workspace/homelab/services/swarm/stacks/portainer-stack.yml) vs [`tools-stack.yml`](file:///workspace/homelab/services/swarm/stacks/tools-stack.yml)
**Problem**: Portainer is defined in both files with different configurations:
- `portainer-stack.yml`: Uses agent mode with global agents
- `tools-stack.yml`: Uses socket mode (simpler but less scalable)
**Fix**: Pick one approach and remove the duplicate.
### 7. **Missing Traefik Network Declaration**
**File**: [`monitoring-stack.yml:38-44`](file:///workspace/homelab/services/swarm/stacks/monitoring-stack.yml#L38-L44)
**Problem**: Prometheus has Traefik labels but isn't on the `traefik-public` network.
**Fix**:
```yaml
prometheus:
networks:
- monitoring
- traefik-public # Add this
```
---
## 🟡 Medium-Priority Improvements
### 8. **Missing Restart Policies**
**Files Affected**: Most services
**Problem**: Only Portainer has restart policies. Other services will fail permanently on error.
**Fix**: Add to all services:
```yaml
deploy:
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
```
### 9. **Watchtower Interval Too Frequent**
**File**: [`full-stack-complete.yml:191`](file:///workspace/homelab/services/swarm/stacks/full-stack-complete.yml#L191)
**Problem**: `--interval 300` = check every 5 minutes (too frequent)
**Fix**: Change to hourly or daily:
```yaml
command: --cleanup --interval 86400 # Daily
```
### 10. **Missing Logging Configuration**
**Files Affected**: All
**Problem**: No log driver or limits configured. Logs can fill disk.
**Fix**:
```yaml
deploy:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
```
### 11. **Version 3.9 is Deprecated**
**Files Affected**: All
**Problem**: Docker Compose v3.9 is deprecated. Should use Compose Specification (no version field) or v3.8.
**Fix**: Remove version line or use `version: '3.8'`
---
## 🟢 Best Practice Recommendations
### 12. **Add Update Configs**
**Benefit**: Zero-downtime deployments
```yaml
deploy:
update_config:
parallelism: 1
delay: 10s
failure_action: rollback
order: start-first
```
### 13. **Use Specific Image Tags**
**Files Affected**: Services using `:latest`
**Current**:
```yaml
image: portainer/portainer-ce:latest
image: searxng/searxng:latest
```
**Better**:
```yaml
image: portainer/portainer-ce:2.33.4
image: searxng/searxng:2024.11.20
```
**Good tags already used**: `full-stack-complete.yml` has several pinned versions ✓
### 14. **Add Labels for Documentation**
**Benefit**: Self-documenting infrastructure
```yaml
deploy:
labels:
- "com.homelab.description=Paperless document management"
- "com.homelab.maintainer=@sj98"
- "com.homelab.version=2.19.3"
```
### 15. **Separate Configs from Stacks**
**Problem**: Mixing config and stack definitions
**Current**: Prometheus config is external (good!)
**Recommendation**: Do the same for Traefik, Alertmanager configs
### 16. **Add Dependency Ordering**
**Current**: Some services use `depends_on` (good!)
**Problem**: Not all services that need it have it
```yaml
paperless:
depends_on:
- paperless-redis
- paperless-db
```
---
## 📋 Detailed File-by-File Analysis
### [`full-stack-complete.yml`](file:///workspace/homelab/services/swarm/stacks/full-stack-complete.yml)
**Good**:
- ✅ Proper network segmentation (traefik-public vs homelab-backend)
- ✅ Resource limits defined
- ✅ Node placement constraints
- ✅ Specific image tags for most services
**Issues**:
- 🔴 Hardcoded passwords (lines 96, 98)
- 🔴 No health checks
- ⚠️ paperless-db missing network
- ⚠️ Resource limits may be too high for Pi 4
**Score**: 6/10
---
### [`monitoring-stack.yml`](file:///workspace/homelab/services/swarm/stacks/monitoring-stack.yml)
**Good**:
- ✅ Proper monitoring network
- ✅ External configs for Prometheus
- ✅ Resource limits
**Issues**:
- 🔴 Hardcoded Grafana password (line 52)
- 🔴 node-exporter has wrong command (lines 111-114)
- ⚠️ Prometheus missing traefik-public network
- ⚠️ No health checks
**Score**: 5/10
---
### [`networking-stack.yml`](file:///workspace/homelab/services/swarm/stacks/networking-stack.yml)
**Good**:
- ✅ Uses secrets for DuckDNS token
- ✅ External volume for Let's Encrypt
- ✅ Proper network attachment
**Issues**:
- ⚠️ Traefik single replica (should be 2+ for HA)
- ⚠️ No health check
- ⚠️ whoami resource limits too strict
**Score**: 7/10
---
### [`portainer-stack.yml`](file:///workspace/homelab/services/swarm/stacks/portainer-stack.yml)
**Good**:
- ✅ Has restart policies!
- ✅ Supports both Windows and Linux agents
- ✅ Proper network setup
**Issues**:
- ⚠️ Duplicate of tools-stack.yml Portainer
- ⚠️ No health check
**Score**: 7/10
---
### [`tools-stack.yml`](file:///workspace/homelab/services/swarm/stacks/tools-stack.yml)
**Good**:
- ✅ All tools on manager node (correct)
- ✅ Resource limits defined
**Issues**:
- ⚠️ Duplicate Portainer definition
- ⚠️ lazydocker needs TTY, won't work in Swarm
- ⚠️ No restart policies
**Score**: 6/10
---
### [`node-exporter-stack.yml`](file:///workspace/homelab/services/swarm/stacks/node-exporter-stack.yml)
**Content** (created by us):
```yaml
version: '3.8'
services:
node-exporter:
image: prom/node-exporter:latest
command:
- '--path.rootfs=/host'
volumes:
- '/:/host:ro,rslave'
deploy:
mode: global
```
**Good**:
- ✅ Global mode (runs on all nodes)
- ✅ Read-only host mount
**Issues**:
- ⚠️ Uses `:latest` tag
- ⚠️ No resource limits
- ⚠️ No health check
**Score**: 6/10
---
## 🛠️ Recommended Action Plan
### Phase 1: Critical Security (Do Immediately)
1. ✅ Create Docker secrets for all passwords
2. ✅ Update stack files to use secrets
3. ✅ Fix node-exporter command
4. ✅ Add missing network to paperless-db
### Phase 2: Stability (Do This Week)
1. ⏭️ Add health checks to all services
2. ⏭️ Add restart policies
3. ⏭️ Fix Prometheus network
4. ⏭️ Remove duplicate Portainer
### Phase 3: Optimization (Do This Month)
1. ⏭️ Update all `:latest` tags to specific versions
2. ⏭️ Add update configs
3. ⏭️ Configure logging limits
4. ⏭️ Review resource limits
### Phase 4: Best Practices (Ongoing)
1. ⏭️ Add documentation labels
2. ⏭️ Separate configs from stacks
3. ⏭️ Set up monitoring alerts for service health
---
## 🎯 Summary Scores
| Stack File | Security | Stability | Best Practices | Overall |
|-----------|----------|-----------|----------------|---------|
| full-stack-complete.yml | 3/10 | 6/10 | 7/10 | **6/10** |
| monitoring-stack.yml | 4/10 | 5/10 | 6/10 | **5/10** |
| networking-stack.yml | 8/10 | 6/10 | 7/10 | **7/10** |
| portainer-stack.yml | 7/10 | 7/10 | 7/10 | **7/10** |
| tools-stack.yml | 7/10 | 5/10 | 6/10 | **6/10** |
| node-exporter-stack.yml | 7/10 | 5/10 | 6/10 | **6/10** |
| **Average** | **6.0/10** | **5.7/10** | **6.5/10** | **6.2/10** |
---
## 📝 Next Steps
Would you like me to:
1. **Create fixed versions** of the stack files with all critical issues resolved?
2. **Generate Docker secrets creation script** for all passwords?
3. **Add health checks** to all services?
4. **Consolidate duplicate configs** (e.g., remove duplicate Portainer)?
5. **Create a migration guide** for applying these changes safely?
Let me know which improvements you'd like me to implement!