Architecture Decision Records¶

ADR 001: Runtime Backend Interface¶

Status: Accepted

Context: Kohakku needs to spawn agent containers on multiple infrastructure backends (local Docker, Kubernetes, AWS ECS). Each has different APIs and lifecycle management.

Decision: Define a Go Backend interface with 5 methods: Spawn, Status, Cancel, Logs, ListRunning. All backends implement this interface. The Dispatcher selects backend via RUNTIME_BACKEND env var.

Consequences:

Adding a new backend requires only implementing the interface -- no changes to consumer, monitor, or API code
Testing uses mock implementations (dockerClient interface for local, fake clientset for K8s, ecsClient interface for ECS)
Port exposure, resource limits, and CMD override are part of SpawnRequest -- backends handle them differently but the consumer does not know which backend it is using
WorkerBackend (subprocess) does not support all features (ports, logs) -- it stubs them gracefully

ADR 002: Tenancy via FK Scoping¶

Status: Accepted

Context: Multi-tenancy needs data isolation between organizations. Two approaches: FK-based scoping (org FK on every model + middleware) or schema-per-tenant (Postgres schema isolation via django-tenants).

Decision: Implement FK-based scoping with TenantMiddleware that resolves request.organization and request.project from headers, session, or default membership. Schema-per-tenant is a future option behind a feature flag.

Consequences:

Every query that should be tenant-scoped must include .filter(organization=request.organization) -- missing a filter leaks data across tenants
TenantMiddleware resolves org from X-Organization header, session, or first membership (in that priority order)
Hierarchy: Organization -> Team -> Workspace -> Project. Tasks and WorkflowInstances scope to Project. Skills and Configs scope to Organization.
FK approach is simpler to implement, works with Django admin, and avoids migration-per-schema overhead

Data isolation risk

A missed filter is a data leak. All routes and queries must be audited for tenant scoping.

ADR 003: Temporal for Workflow Orchestration¶

Status: Accepted

Context: Agent workflows (chained dispatch, fan-out, supervisor/worker, review loops) need durable execution with signal-based coordination, timeout handling, and retry policies.

Decision: Use Temporal.io as the workflow orchestration engine. Workflow definitions in Python using the Temporal SDK. Activities bridge to Django ORM for database operations and to the dispatch pipeline for task creation.

Consequences:

Workflows are deterministic -- no ORM calls, HTTP requests, or random numbers directly in workflow code. All side effects go through activities.
Signal-based coordination: task callbacks send Temporal signals, human gates wait for operator signals
Activities use run_in_executor to bridge synchronous Django ORM operations into async Temporal activities
7 workflow patterns: TemplateExecution (generic), SingleDispatch, ChainedDispatch, FanOut, SupervisorWorker, ReviewCritiqueLoop, Advisor
Worker process runs separately from Django via temporal_worker.py entry point

ADR 004: Pluggable Queue Source for Dispatch¶

Status: Accepted

Context: The Dispatcher consumes tasks from a Redis list (BRPOP). This limits horizontal scaling -- only one consumer can dequeue from a list at a time. Multi-Dispatcher deployments need fan-out.

Decision: Implement a pluggable queue.Source interface. Three implementations:

InternalSource -- Redis list (default)
RedisStreamSource -- Redis Streams with consumer groups
SQS -- stub for future use

Controller gains dispatch_mode (http/bus) on DispatcherInstance to choose between HTTP POST and direct queue publishing.

Consequences:

QUEUE_SOURCE=internal (default) preserves existing single-instance behavior
QUEUE_SOURCE=redis-stream enables multi-Dispatcher fan-out via XREADGROUP consumer groups
Controller's dispatch_mode=bus publishes directly to Redis Stream or SQS, bypassing the Dispatcher's HTTP API for task submission
Adding a new queue source requires implementing the Source.Dequeue() method
Redis Streams provide at-least-once delivery with acknowledgment (XACK after processing)

ADR 005: Namespace Isolation for K8s and ECS¶

Status: Accepted

Context: Agent containers must be isolated from the Dispatcher and Controller infrastructure.

Decision:

KubernetesECS

Agents run in a dedicated agents namespace (configurable via K8S_NAMESPACE)
Dispatcher's ServiceAccount has RBAC permissions only in the agents namespace (Jobs, Pods, Services, ConfigMaps)
Agents have no ServiceAccount token mounted (automountServiceAccountToken: false)
NetworkPolicy restricts agent-to-agent communication (optional, configurable)

Agents run with a dedicated task role that has no AWS API permissions by default
Security group allows outbound only (egress to Dispatcher endpoint + internet)
Task execution role has ECR pull + CloudWatch logs only
Dispatcher's IAM role can only RunTask/StopTask on the designated cluster

Consequences:

Agents cannot access K8s API, AWS APIs, or other infrastructure by default
Operators must explicitly grant permissions for skills that need them
The Dispatcher is the sole identity that can spawn and terminate agents
Terraform modules in terraform/ecs/ and terraform/gke/ enforce this at the infrastructure level

ADR 006: Encryption at Rest¶

Status: Accepted

Context: Briefs, skill packages, and secrets contain sensitive data (instructions, API keys, credentials). Data at rest must be encrypted in all storage layers.

Decision:

Database (Postgres)¶

Use Postgres TDE or managed database encryption (RDS, Cloud SQL)
Application-level: callback tokens stored as SHA-256 hashes, never plaintext
Secrets stored via the secretstore app using Fernet symmetric encryption (AES-128-CBC)

Object Storage (MinIO/S3)¶

S3: enable server-side encryption (SSE-S3 or SSE-KMS)
MinIO: enable encryption with mc encrypt set or auto-encryption via environment variables
Brief content is already content-addressed (SHA-256 hash in key path)

Redis¶

Redis does not encrypt data at rest by default
For production: use Redis with TLS (in-transit) and encrypted EBS volumes (at-rest)
Sensitive data in Redis is transient (task state, progress, queue) -- not secrets

Secrets¶

secretstore.LocalEncryptedBackend uses Fernet (cryptography library) for local dev/test
Production: use AWSSecretsManagerBackend or VaultBackend -- secrets never stored in the application database
FERNET_SECRET_KEY setting required for local backend

Consequences:

No plaintext secrets in Postgres -- either hashed (tokens) or encrypted (secretstore)
S3/MinIO encryption is infrastructure-level, not application-level
Redis encryption requires TLS configuration -- not enabled in dev docker-compose
Operators must configure encryption for their deployment target