Production Readiness
Checklist for deploying Recursiv to production
Assumptions (defaults)
- Deployment model: multi-tenant SaaS
- Target scale (6-12 months): 50k DAU, 200 RPS peak sustained, 1k RPS bursts
- Availability target: 99.9% monthly
- API latency targets: p95 < 300ms, p99 < 800ms
If these assumptions are wrong, update this doc and regenerate the plan.
A+ Rubric
- Data & migrations
- Fully repeatable migrations across environments (no manual drift fixes)
- CI validates fresh DB migrate + rollback rehearsal
- Backward compatible migrations by default
- Resilience & operations
- SLOs/SLIs defined and alerting tied to them
- Structured logs, metrics, traces with dashboards
- Safe deployments (canary/blue-green) + rollback automation
- Security
- Secrets management + rotation, least privilege
- Continuous dependency scanning + patch automation
- Explicit tests for auth/org scoping/admin surfaces
- Testing & coverage
- Deterministic test DB; no shared state flakiness
- CI runs unit + integration + smoke reliably
- Coverage thresholds on critical paths
- Performance & scale
- Load tests + budgets with regression gates
- Query profiling and index review for hot paths
- Backpressure/circuit breakers where needed
- Recovery & incident readiness
- Backup/restore drills
- Incident playbooks and on-call readiness
- Chaos experiments for critical dependencies
Current Baseline (confirmed in repo)
- CI runs typecheck, security dependency scan, db push, tests, and optional E2E smoke. See
.github/workflows/ci.yml. - Ops docs exist (observability, runbooks, backup/restore). See
docs/operations/.
Gaps (to verify and close)
- SLOs/SLIs not defined in repo.
- No explicit load/perf test suite in CI.
- No explicit migration rollback verification in CI.
- Alerting/dashboards not codified in codebase.
- Coverage thresholds not enforced on critical paths.
Execution Plan (phased)
Phase 1: Baseline & targets
- Finalize SLO/SLI targets and publish an SLO doc
- Inventory critical user journeys and hot paths
Phase 2: Data & migrations
- Add migration integrity checks to CI (fresh DB migrate, schema drift detection)
- Add a rollback rehearsal (or backward-compat lint) for every migration
Phase 3: Reliability & observability
- Standardize structured logging fields
- Add traces + metrics for core API paths
- Define dashboards and alert thresholds
Phase 4: Security
- Enforce secrets policy with automated checks
- Add tests for auth/org/admin paths
- Tighten dependency scanning/patch flow
Phase 5: Performance
- Add load tests for top 5 endpoints
- Add query profiling and index reviews
Phase 6: Recovery
- Document backup/restore RTO/RPO and run drills
- Add chaos test playbooks for DB/cache/queue
Work Queue (first 5)
- Draft SLO doc with current target defaults
- Add CI step: fresh DB migrate + verify
- Add coverage thresholds for auth + permissions
- Add a minimal load test harness for top endpoints
- Add dashboard/alert definitions (placeholder JSON)