Architecture Decision Records¶
ADR 001: Runtime Backend Interface¶
Status: Accepted
Context: Kohakku needs to spawn agent containers on multiple infrastructure backends (local Docker, Kubernetes, AWS ECS). Each has different APIs and lifecycle management.
Decision: Define a Go Backend interface with 5 methods: Spawn, Status, Cancel, Logs, ListRunning. All backends implement this interface. The Dispatcher selects backend via RUNTIME_BACKEND env var.
Consequences:
- Adding a new backend requires only implementing the interface -- no changes to consumer, monitor, or API code
- Testing uses mock implementations (dockerClient interface for local, fake clientset for K8s, ecsClient interface for ECS)
- Port exposure, resource limits, and CMD override are part of
SpawnRequest-- backends handle them differently but the consumer does not know which backend it is using - WorkerBackend (subprocess) does not support all features (ports, logs) -- it stubs them gracefully
ADR 002: Tenancy via FK Scoping¶
Status: Accepted
Context: Multi-tenancy needs data isolation between organizations. Two approaches: FK-based scoping (org FK on every model + middleware) or schema-per-tenant (Postgres schema isolation via django-tenants).
Decision: Implement FK-based scoping with TenantMiddleware that resolves request.organization and request.project from headers, session, or default membership. Schema-per-tenant is a future option behind a feature flag.
Consequences:
- Every query that should be tenant-scoped must include
.filter(organization=request.organization)-- missing a filter leaks data across tenants TenantMiddlewareresolves org fromX-Organizationheader, session, or first membership (in that priority order)- Hierarchy: Organization -> Team -> Workspace -> Project. Tasks and WorkflowInstances scope to Project. Skills and Configs scope to Organization.
- FK approach is simpler to implement, works with Django admin, and avoids migration-per-schema overhead
Data isolation risk
A missed filter is a data leak. All routes and queries must be audited for tenant scoping.
ADR 003: Temporal for Workflow Orchestration¶
Status: Accepted
Context: Agent workflows (chained dispatch, fan-out, supervisor/worker, review loops) need durable execution with signal-based coordination, timeout handling, and retry policies.
Decision: Use Temporal.io as the workflow orchestration engine. Workflow definitions in Python using the Temporal SDK. Activities bridge to Django ORM for database operations and to the dispatch pipeline for task creation.
Consequences:
- Workflows are deterministic -- no ORM calls, HTTP requests, or random numbers directly in workflow code. All side effects go through activities.
- Signal-based coordination: task callbacks send Temporal signals, human gates wait for operator signals
- Activities use
run_in_executorto bridge synchronous Django ORM operations into async Temporal activities - 7 workflow patterns: TemplateExecution (generic), SingleDispatch, ChainedDispatch, FanOut, SupervisorWorker, ReviewCritiqueLoop, Advisor
- Worker process runs separately from Django via
temporal_worker.pyentry point
ADR 004: Pluggable Queue Source for Dispatch¶
Status: Accepted
Context: The Dispatcher consumes tasks from a Redis list (BRPOP). This limits horizontal scaling -- only one consumer can dequeue from a list at a time. Multi-Dispatcher deployments need fan-out.
Decision: Implement a pluggable queue.Source interface. Three implementations:
InternalSource-- Redis list (default)RedisStreamSource-- Redis Streams with consumer groupsSQS-- stub for future use
Controller gains dispatch_mode (http/bus) on DispatcherInstance to choose between HTTP POST and direct queue publishing.
Consequences:
QUEUE_SOURCE=internal(default) preserves existing single-instance behaviorQUEUE_SOURCE=redis-streamenables multi-Dispatcher fan-out via XREADGROUP consumer groups- Controller's
dispatch_mode=buspublishes directly to Redis Stream or SQS, bypassing the Dispatcher's HTTP API for task submission - Adding a new queue source requires implementing the
Source.Dequeue()method - Redis Streams provide at-least-once delivery with acknowledgment (XACK after processing)
ADR 005: Namespace Isolation for K8s and ECS¶
Status: Accepted
Context: Agent containers must be isolated from the Dispatcher and Controller infrastructure.
Decision:
- Agents run in a dedicated
agentsnamespace (configurable viaK8S_NAMESPACE) - Dispatcher's ServiceAccount has RBAC permissions only in the agents namespace (Jobs, Pods, Services, ConfigMaps)
- Agents have no ServiceAccount token mounted (
automountServiceAccountToken: false) - NetworkPolicy restricts agent-to-agent communication (optional, configurable)
- Agents run with a dedicated task role that has no AWS API permissions by default
- Security group allows outbound only (egress to Dispatcher endpoint + internet)
- Task execution role has ECR pull + CloudWatch logs only
- Dispatcher's IAM role can only RunTask/StopTask on the designated cluster
Consequences:
- Agents cannot access K8s API, AWS APIs, or other infrastructure by default
- Operators must explicitly grant permissions for skills that need them
- The Dispatcher is the sole identity that can spawn and terminate agents
- Terraform modules in
terraform/ecs/andterraform/gke/enforce this at the infrastructure level
ADR 006: Encryption at Rest¶
Status: Accepted
Context: Briefs, skill packages, and secrets contain sensitive data (instructions, API keys, credentials). Data at rest must be encrypted in all storage layers.
Decision:
Database (Postgres)¶
- Use Postgres TDE or managed database encryption (RDS, Cloud SQL)
- Application-level: callback tokens stored as SHA-256 hashes, never plaintext
- Secrets stored via the
secretstoreapp using Fernet symmetric encryption (AES-128-CBC)
Object Storage (MinIO/S3)¶
- S3: enable server-side encryption (SSE-S3 or SSE-KMS)
- MinIO: enable encryption with
mc encrypt setor auto-encryption via environment variables - Brief content is already content-addressed (SHA-256 hash in key path)
Redis¶
- Redis does not encrypt data at rest by default
- For production: use Redis with TLS (in-transit) and encrypted EBS volumes (at-rest)
- Sensitive data in Redis is transient (task state, progress, queue) -- not secrets
Secrets¶
secretstore.LocalEncryptedBackenduses Fernet (cryptography library) for local dev/test- Production: use
AWSSecretsManagerBackendorVaultBackend-- secrets never stored in the application database FERNET_SECRET_KEYsetting required for local backend
Consequences:
- No plaintext secrets in Postgres -- either hashed (tokens) or encrypted (secretstore)
- S3/MinIO encryption is infrastructure-level, not application-level
- Redis encryption requires TLS configuration -- not enabled in dev docker-compose
- Operators must configure encryption for their deployment target