Database Backup and Disaster Recovery¶
Postgres¶
Automated backups¶
# Daily backup via cron (add to crontab)
0 2 * * * pg_dump -Fc -h localhost -U dbadmin kohakku-controller > /backups/controller-$(date +\%Y\%m\%d).dump
0 2 * * * pg_dump -Fc -h localhost -U postgres kohakku-dispatcher > /backups/dispatcher-$(date +\%Y\%m\%d).dump
Point-in-time recovery¶
For managed databases (RDS, Cloud SQL), enable automated backups with: - Retention: 7-30 days - Backup window: during low-traffic hours - Enable WAL archiving for PITR
Manual backup and restore¶
# Backup
pg_dump -Fc -h $POSTGRES_HOST -U $POSTGRES_USER $POSTGRES_DB > backup.dump
# Restore
pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB --clean backup.dump
# Restore specific tables
pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -t tasks_task backup.dump
Redis¶
Persistence configuration (docker-compose)¶
Redis is configured with AOF persistence:
Backup¶
# Trigger RDB snapshot
redis-cli BGSAVE
# Copy the dump file
cp /data/dump.rdb /backups/redis-$(date +%Y%m%d).rdb
Recovery¶
Redis data is reconstructible from Postgres — task state, queue depth, and progress data can be rebuilt. Redis loss is an operational inconvenience, not a data loss event. Running tasks will need to be re-registered by the monitor on restart.
MinIO / Object Storage¶
Backup briefs and skills¶
# Mirror to S3 (or another MinIO instance)
mc mirror local/kohakku s3/kohakku-backup
# Mirror specific prefixes
mc mirror local/kohakku/briefs s3/kohakku-backup/briefs
mc mirror local/kohakku/skills s3/kohakku-backup/skills
Recovery¶
For production: use S3 with versioning and cross-region replication instead of MinIO.
Disaster Recovery Procedures¶
Total database loss¶
- Restore Postgres from latest backup
- Restart Controller and Dispatcher — they reconnect automatically
- Redis will be rebuilt by the monitor (re-registers active containers)
- Running agent containers continue — they check back on their own schedule
Redis loss¶
- Restart Redis — it recovers from AOF/RDB
- If AOF is corrupted: start with empty Redis
- The monitor will re-register active containers from Postgres
- Queue depth resets — pending tasks in Postgres can be re-dispatched
MinIO / S3 loss¶
- Restore from backup mirror
- Briefs for completed tasks are not needed (results already recorded)
- Skills can be re-uploaded from source
- Running agents that already downloaded their brief are unaffected
Full cluster recovery¶
- Restore Postgres first (source of truth)
- Start Redis (AOF recovery or fresh)
- Start MinIO (restore from backup)
- Start Controller, Dispatcher, Temporal
- Running agents will time out and be cleaned up by the monitor
- Re-dispatch any in-progress tasks that were lost