Skip to content

Monitoring

Health Checks

Both services expose health and readiness endpoints.

Controller

Endpoint Method What it checks
/health/ GET Database connectivity
/ready/ GET Database + Redis + MinIO
curl -s http://localhost:8000/health/ | python -m json.tool
curl -s http://localhost:8000/ready/  | python -m json.tool

Dispatcher

Endpoint Method What it checks
/health GET Service alive
/ready GET Database + Redis connectivity
curl -s http://localhost:8080/health | jq .
curl -s http://localhost:8080/ready  | jq .

Load Balancer Configuration

Use /health (Dispatcher) or /health/ (Controller) for liveness probes. Use /ready or /ready/ for readiness probes -- these check downstream dependencies and will fail if Postgres or Redis is unreachable.

# Kubernetes liveness/readiness probes
livenessProbe:
  httpGet:
    path: /health/
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready/
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 15

Dashboards

Temporal UI

Access at http://localhost:8088 in local dev. Provides:

  • Workflow execution history and status
  • Signal delivery tracking
  • Activity task queue depth
  • Worker availability

MinIO Console

Access at http://localhost:9001 (dev credentials: minioadmin / minioadmin). Shows:

  • Brief and skill storage usage
  • Object lifecycle and versioning
  • Bucket access patterns

Controller Admin

Django admin at http://localhost:8000/admin provides:

  • Task list with status filtering
  • Dispatcher instance status and last heartbeat
  • Workflow instance status and stage progress
  • Skill and config version history

Dispatcher Operator UI

The Dispatcher serves an HTMX-based operator UI at http://localhost:8080. Features:

  • Agent gallery -- running agents as tiles with live status
  • noVNC links for visual agent tiers (terminal, browser, desktop)
  • Task telemetry -- duration, tokens used, exit codes

Key Metrics to Track

Controller

Metric Source Alert threshold
Task dispatch latency Application logs > 5s
Brief assembly time Celery task duration > 10s
Failed dispatches Task status counts > 5% failure rate
Pending workflow instances Temporal UI Growing unbounded
Database connection pool PgBouncer stats > 80% utilization

Dispatcher

Metric Source Alert threshold
Queue depth Redis LLEN or stream XLEN Growing unbounded
Active tasks Redis task state Approaching concurrency limit
Container spawn failures Application logs Any
Image pull latency Application logs > 60s (cold pull)
Monitor loop duration Application logs > configured interval

Log Aggregation

Application Logs

Both services log to stdout in JSON format in production. Route to your log aggregation system:

Container stdout is captured automatically by ECS/EKS log drivers.

Use Promtail to scrape container logs. Label by service name.

Use the Datadog agent with Docker or K8s autodiscovery.

Agent Logs

Agent logs follow two paths:

  1. Packaged session logs -- Agent writes a session log, packages it on completion, ships it back as part of the result payload. The Dispatcher stores the reference.
  2. Cluster-level log forwarding -- Container stdout piped to the cluster's logging facility via runtime-level config. Infrastructure concern, not a Kohakku concern.

Alerting

Alert philosophy

All alerts must require human interaction. If an alert doesn't need a human response, it's signal noise -- turn it into a dashboard metric instead.

Critical Alerts

  • Controller /ready/ returns non-200 for > 2 minutes
  • Dispatcher /ready returns non-200 for > 2 minutes
  • Task failure rate exceeds threshold in a rolling window
  • Queue depth growing with no consumer progress
  • Temporal worker count drops to zero

Warning Alerts

  • Brief assembly p95 latency exceeds 10 seconds
  • Dispatcher heartbeat missed (stale dispatcher in registry)
  • Object storage usage approaching quota
  • Database connection pool > 80% utilized