Integration tests · Unit tests · Coverage strategy

Test Plan Generator for
AI-Generated Code

Q: Does Stackbilder generate test code or just a test plan?

Free tier generates a test plan — test scenario specifications with expected inputs, expected outputs, and the threat item or architectural constraint each test is verifying. Pro adds LLM-generated test stubs: actual Vitest/Playwright code, guided by the deterministic test plan, so the specs become runnable starting points rather than documentation only.

Q: How does the test plan relate to the threat model?

Each threat item in your model has a corresponding test scenario in the test plan. T-002 (IDOR on D1 row updates) maps to test_idor_row_update: verify that user A cannot modify user B's records. This ensures your test coverage actually verifies the mitigations you documented — not just that endpoints return 200.

Q: What coverage percentage do the generated test specs target?

Coverage targets are tiered by concern, not a flat percentage. Auth and data access paths target 90%+. Business logic (billing, role management) targets 85%. Error handling and edge cases target 70%. The test plan explains the rationale for each tier — coverage theater (hitting 80% with trivial tests) is worse than deliberate coverage targeting.

Q: Can I use this test plan with Vitest, Jest, and Playwright?

Yes. The test plan specifies scenarios, not implementation. You can implement the scenarios in whichever framework fits your project. For Cloudflare Workers, Vitest + Miniflare is the recommendation and the Pro tier generates stubs for it — but the test plan itself works as a specification for any test runner.

Convert your architecture into integration and unit test specifications — before you write a line of implementation. Test scenarios derived from your threat model, not invented from scratch.

Start free See a sample scaffold

✓Threat-ID linked scenarios ✓Multi-tenant isolation tests ✓Billing webhook tests ✓Auth flow tests ✓Framework recommendations ✓Tiered coverage targets

The problem with AI-generated tests

AI builders generate code. The tests are shallow when they exist at all.

When you ask an AI builder to add tests to a codebase, the result is usually test coverage theater: a suite that hits 70% coverage by testing that functions exist, that routes return 200, and that the happy path runs without throwing. The tests aren't wrong — they're just not testing the things that matter.

A test suite that doesn't verify tenant isolation is missing its most important test. A billing integration without a test for Stripe webhook replay attacks is a future incident waiting to happen. Session invalidation on password change is a four-line implementation detail that will never appear in an AI-generated test unless someone thought to ask for it specifically. These aren't corner cases — they're the failure modes that show up in production for real users.

The root problem is where AI-generated tests come from: the code. The agent reads the implementation and writes tests that confirm what the code already does. A test plan that comes from the threat model does something fundamentally different — it describes what the code needs to prevent, and the tests verify the prevention. That's the difference between "the login endpoint returns 200" and "a session from a deleted user is rejected on the next request."

For a SaaS application with real users and real money moving through it, the second kind of test is the one that matters. A Stripe webhook that delivers the same event twice must not double-upgrade a user. A tenant's D1 rows must be invisible to every other tenant — not by convention, but by test-verified query enforcement. A password change must immediately invalidate all existing sessions. These scenarios have to be specified before implementation, not discovered in production.

Stackbilder generates your test plan from the threat model, before the first line of implementation. Currently for Cloudflare Workers scaffolds — the most complex testing surface in the edge runtime ecosystem.

What AI-generated test suites typically miss

Stripe webhook replay

Same event_id delivered twice should not double-upgrade a user tier. Idempotency key check required.

Cross-tenant D1 access

Tenant A's session must not be able to query or modify tenant B's rows, even with direct record IDs.

Session invalidation

Password change or account ban must reject the next request from an active session — not at token expiry.

DO queue overflow

Burst requests beyond queue capacity must return 429, not time out silently and leave state partially modified.

Failed payment consistency

A Stripe payment failure mid-checkout must not leave the user's tier in a partially-upgraded state.

R2 path traversal

User-supplied file paths must not allow reading arbitrary bucket objects via ../../ traversal sequences.

What Stackbilder generates

Three test layers. Threat-linked scenarios. Coverage that means something.

Unit tests — Vitest + Miniflare

Worker logic in isolation. Miniflare simulates the Cloudflare runtime — D1, KV, R2, and Durable Objects — without deploying. Unit tests cover middleware behavior (auth rejection, role checking), KV key normalization, D1 query patterns, and DO request handling. The test plan specifies what each unit test is verifying and which threat item it addresses.

// vitest + miniflare · auth middleware
test("rejects request without session cookie", ...)
test("rejects expired session token", ...)
test("rejects session from deleted user", ...)
test("grants access with valid session", ...)
// T-001 coverage: session brute-force
test("rejects token below 128-bit entropy", ...)

Integration tests — D1 via wrangler

Database behavior with real SQL. Integration tests run against an actual D1 instance via wrangler d1 execute in CI. They cover schema migration idempotency, RLS enforcement (the core multi-tenant guarantee), constraint validation, and the full Stripe billing state machine. These tests exist specifically because D1 behavior in Miniflare and D1 behavior in production can diverge on edge cases.

// D1 integration — multi-tenant isolation
test_rls_tenant_a_cannot_read_tenant_b()
test_rls_update_rejected_cross_tenant()
test_migration_idempotent_on_rerun()
// T-002 coverage: IDOR on D1 row updates
test_idor_update_rejected_by_tenant_id()

End-to-end — Playwright

Full user journeys with a real browser. E2E tests cover the auth flow (login → session cookie set → protected route accessible), billing (checkout redirect → Stripe → webhook → tier upgrade verified), file upload via R2 presigned URL, and session invalidation (password change → re-request rejected). Cookie handling in E2E tests requires a real browser — Miniflare's fetch doesn't set cookies in the same way.

// playwright · full auth + billing flow
test("login → cookie set → route auth", ...)
test("logout → session cookie cleared", ...)
test("checkout → webhook → tier upgraded", ...)
test("password change → session invalidated", ...)
// T-009 coverage: session invalidation
test("active session rejected post-ban", ...)

Billing and idempotency tests

Stripe integration tests are a category of their own. They cover the Stripe webhook handler's signature verification, replay prevention (same event_id delivered twice must not cause a double-upgrade), tier downgrade at period end (subscription canceled → tier reverts at the right timestamp), and the consistency guarantee for failed checkouts (no partial tier mutations on payment failure). These scenarios don't appear in AI-generated test suites unless someone specifically asked — they appear in Stackbilder's test plan because T-005 and T-007 exist in your threat model.

// billing idempotency — T-005, T-007
test_webhook_replay_no_double_upgrade()
test_webhook_sig_required_reject_unsigned()
test_failed_checkout_no_tier_mutation()
test_cancel_tier_reverts_at_period_end()
Coverage target: 90%+ on billing paths

Example output

A real test plan section — threat-linked, framework-specified.

This is the multi-tenant isolation section of a test plan generated for a Cloudflare Workers SaaS. Each scenario maps to a threat item and specifies exactly what's being verified and why.

.ai/test-plan.md — §3 Data Isolation

## §3 Multi-Tenant Data Isolation

Framework: Vitest + Miniflare (unit) · wrangler d1 execute (integration)

Threat refs: T-002 (IDOR), T-005 (cross-tenant read via KV)

Coverage target: 90%+ on all data access paths

### 3.1 D1 Row Isolation (Integration)

Scenario: Tenant A cannot read Tenant B's records via direct record ID.

Setup: Insert two tenants with separate rows in the records table.

Action: Authenticate as Tenant A; request Tenant B's record ID via GET /api/records/:id.

Expected: 404 (not 403 — don't confirm existence). Row must not appear in response.

Verify: Query D1 directly to confirm no data returned; check response body is empty.

### 3.2 D1 Row Mutation Isolation (Integration)

Scenario: Tenant A cannot update Tenant B's records via IDOR in PUT body.

Setup: Insert Tenant B's record. Authenticate as Tenant A.

Action: PUT /api/records/:id with Tenant B's record ID in the request body.

Expected: 404. Tenant B's record unchanged in D1 after the request.

Verify: Query D1 for Tenant B's record; assert original values intact.

### 3.3 KV Key Space Isolation (Unit)

Scenario: KV reads are prefixed by tenant_id; user input cannot escape tenant scope.

Action: Call the KV read helper with a key containing "../other-tenant" traversal.

Expected: Key is normalized; traversal prefix is stripped. KV get uses sanitized key.

Verify: Assert the KV get call received tenant_id + normalized_key, not raw input.

### 3.4 Session Does Not Carry Cross-Tenant State (Unit)

Scenario: Session object must not expose another tenant's data in any field.

Action: Create sessions for two tenants; load each session independently.

Expected: Each session object contains only that tenant's user_id and org_id.

Verify: Assert no cross-contamination across session objects in the same test run.

How it works

Threat model in. Three-layer test plan out.

Architecture + threat model

Stackbilder generates your threat model first. Each threat item — IDOR, session brute-force, webhook replay — becomes a test scenario requirement. The test plan is derived from the threats, not invented independently.

Test plan in 20ms

The TarotScript engine maps each threat and architectural constraint to test scenarios. Unit tests for middleware and query helpers. Integration tests for D1 isolation and billing state. E2E tests for auth flows and file handling. Framework is specified per layer. Deterministic — same architecture, same test plan.

Implement against the spec

The test plan is a specification, not implementation. Use it to guide your test suite in Vitest, Jest, or Playwright — or let Pro generate stubs. The coverage targets tell you what matters and why, not just what percentage to aim for.

Common questions

Does Stackbilder generate test code or just a test plan?

Free tier generates a test plan — scenario specifications with expected inputs, outputs, and threat-item references. Pro adds LLM-generated test stubs: actual Vitest/Playwright code guided by the deterministic plan. The stubs are starting points, not complete implementations — you fill in the setup and assertion details for your specific data models.

How does the test plan relate to the threat model?

Every threat item in your model has a corresponding test scenario. T-002 (IDOR on D1 row updates) maps to test_idor_update_rejected_by_tenant_id. This means your test coverage actually verifies the mitigations you documented, not just that endpoints return 200. Traceability from threat to test is the core value — it's what makes coverage meaningful.

What testing frameworks does the test plan target?

For Cloudflare Workers scaffolds: Vitest with Miniflare for unit and integration tests (worker logic, D1 queries, binding behavior) and Playwright for E2E scenarios. The plan specifies why each framework applies to each test layer — D1 integration tests need the D1 simulator; cookie-based auth E2E tests need a real browser.

What coverage percentage do the generated test specs target?

Coverage targets are tiered by concern: auth and data access paths at 90%+, business logic (billing, role management) at 85%, error handling and edge cases at 70%. The plan explains the rationale — coverage theater (high percentage via trivial tests) is worse than deliberate targeting of high-risk paths.

Can I use this test plan with Vitest, Jest, and Playwright?

Yes. The test plan is a specification, not framework-specific implementation. You can implement the scenarios in any test runner. For Cloudflare Workers, Vitest + Miniflare is the recommendation and the Pro tier generates stubs for it — but the plan itself is framework-agnostic.

AI app hardening → Threat model generator → ADR generator → Cloudflare Workers scaffold → Production readiness checklist →

Test plans that start from your threat model.

Free tier includes 3 scaffolds per month — test plan, threat model, and ADRs with every one. No credit card.