Production Readiness

Checklist for deploying Recursiv to production

Assumptions (defaults)

  • Deployment model: multi-tenant SaaS
  • Target scale (6-12 months): 50k DAU, 200 RPS peak sustained, 1k RPS bursts
  • Availability target: 99.9% monthly
  • API latency targets: p95 < 300ms, p99 < 800ms

If these assumptions are wrong, update this doc and regenerate the plan.

A+ Rubric

  1. Data & migrations
    • Fully repeatable migrations across environments (no manual drift fixes)
    • CI validates fresh DB migrate + rollback rehearsal
    • Backward compatible migrations by default
  2. Resilience & operations
    • SLOs/SLIs defined and alerting tied to them
    • Structured logs, metrics, traces with dashboards
    • Safe deployments (canary/blue-green) + rollback automation
  3. Security
    • Secrets management + rotation, least privilege
    • Continuous dependency scanning + patch automation
    • Explicit tests for auth/org scoping/admin surfaces
  4. Testing & coverage
    • Deterministic test DB; no shared state flakiness
    • CI runs unit + integration + smoke reliably
    • Coverage thresholds on critical paths
  5. Performance & scale
    • Load tests + budgets with regression gates
    • Query profiling and index review for hot paths
    • Backpressure/circuit breakers where needed
  6. Recovery & incident readiness
    • Backup/restore drills
    • Incident playbooks and on-call readiness
    • Chaos experiments for critical dependencies

Current Baseline (confirmed in repo)

  • CI runs typecheck, security dependency scan, db push, tests, and optional E2E smoke. See .github/workflows/ci.yml.
  • Ops docs exist (observability, runbooks, backup/restore). See docs/operations/.

Gaps (to verify and close)

  • SLOs/SLIs not defined in repo.
  • No explicit load/perf test suite in CI.
  • No explicit migration rollback verification in CI.
  • Alerting/dashboards not codified in codebase.
  • Coverage thresholds not enforced on critical paths.

Execution Plan (phased)

Phase 1: Baseline & targets

  • Finalize SLO/SLI targets and publish an SLO doc
  • Inventory critical user journeys and hot paths

Phase 2: Data & migrations

  • Add migration integrity checks to CI (fresh DB migrate, schema drift detection)
  • Add a rollback rehearsal (or backward-compat lint) for every migration

Phase 3: Reliability & observability

  • Standardize structured logging fields
  • Add traces + metrics for core API paths
  • Define dashboards and alert thresholds

Phase 4: Security

  • Enforce secrets policy with automated checks
  • Add tests for auth/org/admin paths
  • Tighten dependency scanning/patch flow

Phase 5: Performance

  • Add load tests for top 5 endpoints
  • Add query profiling and index reviews

Phase 6: Recovery

  • Document backup/restore RTO/RPO and run drills
  • Add chaos test playbooks for DB/cache/queue

Work Queue (first 5)

  1. Draft SLO doc with current target defaults
  2. Add CI step: fresh DB migrate + verify
  3. Add coverage thresholds for auth + permissions
  4. Add a minimal load test harness for top endpoints
  5. Add dashboard/alert definitions (placeholder JSON)