BLUF · Recommendation

LangGraph-shaped: structured 7-node spine, LLM inside nodes, interrupt-based tasks.

Pick a structured agent graph as the contract, let the LLM have tool-use latitude inside nodes, and treat every human decision as a first-class interrupt that checkpoints the run. Not a pure LLM loop (sacrifices auditability Lamosa's credit-override needs). Not a heavyweight DAG engine (too slow for Week 1). Supabase carries the state; Python workers run the graph; Claude lives inside the nodes that need it.

2 tenants Week 1 10+ tenants no fork Tasks first-class Python + FastAPI Supabase + Claude

Convergence · Four parallel research streams

What all four agents agreed on.

Four subagents ran in parallel: prior art (Humanloop, Braintrust, Dust, Vellum, LangSmith, Inngest, Gumloop, n8n), red-team of a 4-primitive spine, three architecture variants, and a data-model pressure test. They converged on five decisions.

Dimension	Converged answer	Where it came from
Execution model	Durable graph with checkpointed state per step. Nodes typed, topology explicit, LLM calls live inside specific nodes.	Agent A (LangGraph checkpointer), Agent B (7-node spine), Agent C (loop + pre/post hooks)
Human-in-the-loop	Interrupt pattern. Any node can raise. Task row = resume token. One primitive covers approval, ambiguity, exception.	Agent A (LangGraph `interrupt()`), Agent B (Review as edge), Agent D (`tasks.run_id`)
Tenant customization	Shared core graph. Per-tenant overrides live in files (prompts, connectors), not forks. Versioned with the agent.	Agent A (`tenants/{slug}/...`), Agent D (`agent_versions` + `grounding_refs[]`)
Grounding	Temporal. Grounding rows versioned by `valid_from` / `valid_to`. Runs reference catalog state at their start time, not "now".	Agent D (`grounding_rows`)
LLM layer	Claude via Anthropic SDK. Prompt caching on the system prompt. Extractor and Decide nodes get LLM; Normalize, Compose, Dispatch stay deterministic.	Agent A (prompt caching), Agent B (node typing), Agent C (hooks around LLM)

Decision Matrix

Three variants scored against our constraints.

Scored 0 to 5 against our actual situation (2 wildly different first tenants, Python, Claude, Week 1 target, tasks-as-primitive, small team, 10+ tenants in 12 months). LangGraph-shaped wins because it takes the top score on the two criteria we cannot compromise on: Week 1 ship and audit of the human decision.

Criterion	Pure LLM loop Variant 2	LangGraph-shaped recommended	Heavy DAG engine Prefect / Temporal / Dagster
Week 1 buildable
Auditability of HITL path
Client #3 onboarding cost
Aronlight + Lamosa both fit
Observability / debug
Migration path if we outgrow it
Team size fit (1-2 eng)

The Spine

Seven nodes. Review as an edge. Tasks orthogonal.

The 4-primitive sketch (Source to Extractor to Grounder to Reviewer) broke when walked through both tenants end to end. Classify, Normalize, Decide, Compose, Dispatch were missing. Review cannot be a terminal node because exceptions fire mid-pipeline. Tasks cannot be a Reviewer output because any node emits them. Final shape:

  Inbound
    │
    ▼
  Source      ingest raw artifact (email + attachments, PDF, fax image, Excel, webhook)
    │
    ▼
  Classify    is this RFQ / PO / amendment / cancellation / status-inquiry?  cheap gate
    │
    ▼
  Normalize   OCR fax, parse Excel, strip HTML, unify units.  pre-LLM work
    │
    ▼
  Extract     LLM → structured payload against typed schema.  candidate values, unresolved refs
    │
    ▼
  Ground      resolve refs against tenant catalog + customer master + pricing.  temporal lookup
    │
    ▼
  Decide      business rules.  credit limits, stock policy, approval threshold.  LLM only when rules ambiguous
    │
    ▼
  Compose     assemble output artifact (quote PDF / OMS JSON / quote email)
    │
    ▼
  Dispatch    deliver + confirm + log.  non-skippable post-hook
    │
    ▼
  Outbound

        ┌────────────────────────────────────────────────────┐
        │  Review edge.  emitted by ANY node.                     │
        │  Flow pauses, state checkpointed, task row created. │
        │  Resume on human decision via task.resume_token.      │
        └────────────────────────────────────────────────────┘

        ┌────────────────────────────────────────────────────┐
        │  Tasks are orthogonal.                                  │
        │  One queue, cross-agent, cross-tenant within org.   │
        │  Question + reasoning + pinned context + options.   │
        └────────────────────────────────────────────────────┘

Design Move 1

Ground vs Decide split.

Ground answers "what is this?" (SKU resolution, customer lookup). Decide answers "what do we do?" (credit, stock, pricing policy). Keeps LLM-adjacent work separate from business rules. Decide is mostly deterministic with an LLM escape hatch when rules are ambiguous.

Design Move 2

Compose vs Dispatch split.

Compose builds the artifact (quote PDF, OMS payload, email body). Dispatch delivers it and confirms receipt. Different failure modes, different retry logic. Dispatch is a non-skippable post-hook: the LLM cannot route around it.

Design Move 3

Review is an edge, not a node.

Any node can raise a review task with interrupt() semantics. Flow pauses, state is checkpointed, a task row is written. Human resolves, flow resumes at the node the task specifies. One mechanism covers every HITL pattern.

Design Move 4

Tasks are cross-cutting, not terminal.

Tasks are not Reviewer's output. They are the cross-agent human queue. Any node emits. Any human queue consumes. Reviewer becomes "the default task handler" for approval-class tasks, not a spine position.

System Architecture

How the pieces fit.

                        ┌────────────────────────────────────────────────────────────┐
                        │                       Factor Web UI                          │
                        │   Next.js + Tailwind + Shadcn.  Tasks inbox, Agent Builder.  │
                        │   Auth via Supabase.  Real-time task updates via Realtime.   │
                        └──────────┬──────────────────────────────────────┬──────────┘
                                   │ REST / Realtime                      │ REST
                                   ▼                                      ▼
       ┌─────────────────────────────────────┐    ┌──────────────────────────────────────────┐
       │         Agent Runtime (Python)            │    │             Supabase                       │
       │                                       │    │                                            │
       │   FastAPI ingress (webhooks, API)     │    │   Postgres: orgs, agents, agent_versions,  │
       │   Graph executor                      │◄──►│   runs, steps, tasks, reviews, extractions,│
       │   Node registry (typed)               │    │   sources, grounding_docs, grounding_rows, │
       │   Checkpointer (per super-step)       │    │   events  (append-only audit)              │
       │   Interrupt → task writer             │    │                                            │
       │   Prompt-cache-enabled Claude client  │    │   Storage: raw emails, PDFs, attachments   │
       │                                       │    │   Auth: org-scoped RLS on every table      │
       │                                       │    │   Realtime: tasks channel → UI             │
       └──────┬────────────┬──────────┬────────┘    └──────────────────────────────────────────┘
              │            │          │
              ▼            ▼          ▼
       ┌──────────┐  ┌──────────┐  ┌────────────────────────────────┐
       │  Claude  │  │   OCR    │  │  Tenant connectors (per org)   │
       │   API    │  │ (mistral │  │  tenants/aronlight/odoo.py     │
       │ + cache  │  │  OCR /   │  │  tenants/aronlight/prompts/    │
       │          │  │  tesser) │  │  tenants/lamosa/oms.py         │
       └──────────┘  └──────────┘  │  tenants/lamosa/prompts/       │
                                    └────────────────────────────────┘

Shared Core

One graph, one runtime.

The 7-node spine is defined once in agent_runtime/graph.py. Every agent is a configured instance of this graph with tenant-specific prompts and connectors injected at run start.

Per-Tenant Extensions

Files, not forks.

Tenant customization lives in tenants/{slug}/: prompts as markdown, connectors as Python. Loaded by slug at run start. No core code changes per new tenant.

Checkpointing

Every step is durable.

After each node completes, full graph state is written to steps.checkpoint_jsonb. Process crash = resume from last checkpoint. Review task = resume when human resolves.

Walkthroughs · Both tenants through the spine

Aronlight (quote-gen) end to end.

Inbound RFQ email from Grupo Ribeiro. Four line items, partial SKU hints, one ambiguous spec. The spine produces a quote in ~2 minutes. One task emitted for the ambiguous SKU. Review lives in the Tasks inbox.

SOURCE

IMAP fetch. Email + 1 attachment (PDF with customer purchase history). Stored raw in sources.raw_uri. Parsed envelope + headers to sources.parsed_jsonb.

Auto

CLASSIFY

LLM call with tiny prompt. Decision: rfq. Alternate candidates: follow_up (0.04), complaint (0.01). Confidence 0.94. Proceeds.

Auto

NORMALIZE

PDF attachment has no OCR need (text extractable). Body stripped to plain text. Currency signals: EUR. Units: mixed (inches, cm). Normalizer flags mixed units for Extractor prompt context.

Auto

EXTRACT

Claude Sonnet with cached system prompt. Schema: rfq_line[]. 4 lines extracted. Line 3 raw: "8" PVC 40 SW Caps x 200". Candidate SKUs with rationale, not resolved.

Auto

GROUND

3 lines match Odoo catalog cleanly (0.92, 0.88, 0.95). Line 3 returns three candidates at 0.67, 0.41, 0.31. Grounder raises an interrupt.

Task raised

— pause —

Task Q-2026-0417-L3: "Which SKU matches '8" PVC 40 SW Caps'?" Ranked options CAP-80035 / CAP-8040G / CAP-80CPV. Pinned source snippet. Teach-the-agent toggle. Impact score: €1,340 (line value).

In queue

GROUND (resume)

Human picks CAP-80035 with teach-agent on. Ground re-runs with the correction applied. Line 3 resolved. Teach lesson written to grounding memory.

Auto

DECIDE

Customer: Ribeiro, existing, tier B. Value €5,240. Under €10k threshold → no approval required. Payment: net-30 default. No exceptions. Auto-proceed.

Rules check

COMPOSE

Quote template (Spanish, Aronlight brand). Line items + subtotal + terms + lead times pulled per-SKU. Rendered to PDF. Email body composed with Spanish greeting template.

Auto

DISPATCH

Reply email sent via tenant SMTP. Quote logged to Odoo as draft quote_draft_id. CRM activity written. Run completes.

Auto

Lamosa (PO-processing) end to end.

Inbound PO from a retailer Lamosa hasn't processed before. New layout, fax image, customer on credit hold. Two tasks emitted, routed to different human queues. The spine holds.

SOURCE

Email with fax image attachment (TIFF). Sender: new retailer "Casa Azul SA". Stored raw in sources.raw_uri.

Auto

CLASSIFY

Decision: po. Confidence 0.88. Proceeds.

Auto

NORMALIZE

OCR via Mistral OCR (better than Tesseract on fax). Low OCR confidence on shipping address block. Layout classifier: unseen. Normalizer raises interrupt to train the layout.

Task raised

— pause —

Task PO-2026-0044-LAYOUT: "First time seeing this PO layout from 'Casa Azul'. Train the agent?" Preview of layout. Options: train (auto-detect fields) / skip (manual-extract once) / reject. Routes to ops queue.

In queue · ops

EXTRACT

Resumes with layout template stored. Claude extracts: customer, PO#, 12 line items, requested ship date. Confidence 0.91.

Auto

GROUND

SKU matching against Lamosa catalog: 11 clean, 1 discontinued. Customer "Casa Azul" not in master → creates placeholder, flags for Decide.

Auto

DECIDE

Rules check: new customer without credit profile. Rule: new customers must have credit check before PO load. Raises interrupt routed to AR team queue.

Task raised

— pause —

Task PO-2026-0044-CREDIT: "New customer 'Casa Azul SA' needs credit check before PO load. Value $18,420." Ranked: approve tier A / approve tier B / partial ship pending credit / reject. Routes to AR queue. Impact score reflects PO value.

In queue · AR

DECIDE (resume)

AR resolves with "approve tier B". Decide re-runs. Rule passes with customer now tier B. Discontinued SKU: substitution rule applies. Auto-proceed.

Rules check

COMPOSE

OMS JSON payload built. Lamosa OMS schema. Line items mapped with substituted SKU noted. Validated against OMS schema before dispatch.

Auto

DISPATCH

POST to Lamosa OMS. 200 OK. OMS order ID returned. Confirmation email sent to Casa Azul contact. Run completes.

Auto

Data Model

One schema. Both tenants. No fork.

Ten tables. Multi-tenant from row one via org_id with RLS on every table. Agent versioning via immutable agent_versions so config changes never rewrite history. Grounding versioned temporally. Tasks first-class with a resume token. Events append-only for audit.

  tenant layer
  ┌─────────┐     ┌──────────┐     ┌───────────────────┐
  │  orgs   │◄────┤  agents  │────►│  agent_versions   │  immutable, versioned
  └─────────┘     └──────────┘     │  config_jsonb     │
                                    │  prompts_jsonb    │
                                    │  tools_jsonb      │
                                    │  grounding_refs[] │
                                    └─────────┬─────────┘
                                              │
  runtime layer                           │
  ┌──────────┐     ┌──────────┐     ┌────────▼─────────┐
  │ sources  │────►│   runs   │────►│   steps          │  checkpoint_jsonb per super-step
  └──────────┘     │  status  │     │   node_type      │
                   │  conf    │     │   input_jsonb    │
                   └─────┬────┘     │   output_jsonb   │
                         │          └──────────────────┘
                         │
                         ├───────►┌──────────────┐
                         │        │ extractions  │  schema_name + schema_hash + payload_jsonb
                         │        └──────────────┘
                         │
                         └───────►┌──────────────┐
                                  │    tasks     │  first-class, cross-agent human queue
                                  │  question    │
                                  │  reasoning   │
                                  │  pinned_ctx  │
                                  │  options[]   │
                                  │  unblocks[]  │
                                  │  impact      │
                                  └──────┬───────┘
                                         │
                                         └───►┌─────────┐
                                              │ reviews │  decision + teach_payload
                                              └─────────┘

  reference layer
  ┌─────────────────┐     ┌──────────────────┐
  │ grounding_docs  │◄────┤ grounding_rows   │  temporal:  valid_from, valid_to
  │  (catalog,      │     │  external_id     │
  │   customers,    │     │  payload_jsonb   │
  │   pricing)      │     │  embedding       │
  └─────────────────┘     └──────────────────┘

  audit layer
  ┌─────────────┐
  │   events    │  append-only.  REVOKE UPDATE/DELETE.  aggregate_type + aggregate_id + event_type
  └─────────────┘

Schema tensions we resolved

01Extraction shape varies per tenant. Aronlight needs RFQ line items pointing at SKUs. Lamosa needs PO header + lines + ship-to. Resolved: extractions.schema_name + schema_hash + free payload_jsonb. Schema registry lives in code, not DB, so tenants extend without DDL.
02Grounding source of truth differs. Aronlight pulls Odoo via API. Lamosa uploads SKU master + customer list as files. Unified via grounding_docs.source_system. Runs reference catalog state at their start time via grounding_rows.valid_from.
03Task granularity differs. Aronlight tasks are per-line ("which SKU?"). Lamosa tasks are per-PO-line or per-header ("customer not in master"). Same tasks shape works because pinned_context_jsonb carries the relevant slice and impact_score is computed per-tenant.
04Agent versioning vs run history. runs.agent_version_id points at immutable agent_versions, never at mutable agents. Config changes never rewrite what happened.

Tenant Layout

Shared core, per-tenant files.

The repo has one core graph. Each tenant lives in a folder. Adding client #3 is a new folder plus a few config rows, not a new service.

  agent-platform/
    agent_runtime/
      graph.py            # the 7-node spine, one definition for everyone
      nodes/
        source.py
        classify.py
        normalize.py
        extract.py
        ground.py
        decide.py
        compose.py
        dispatch.py
      checkpointer.py
      interrupts.py       # writes tasks table, reads resume token
      tools/              # shared tool implementations
    api/
      main.py             # FastAPI: webhooks, REST, task resolve
    tenants/
      aronlight/
        prompts/
          classify.md
          extract_rfq.md
          compose_quote.md
        connectors/
          odoo.py         # catalog + customer pulls, quote push
          smtp.py
        config.yaml       # thresholds, feature flags
      lamosa/
        prompts/
          classify.md
          extract_po.md
          compose_oms.md
        connectors/
          oms.py          # OMS push, customer master upload
          ocr.py          # fax OCR pipeline
        config.yaml
    tests/
    migrations/           # Supabase schema migrations

Kill Criteria

What makes us rewrite this.

Every architecture has a breaking point. Naming ours now so we know when to act.

Kill 1

Concurrency scale.

~100 tenants with multi-hour concurrent runs. Our hand-rolled checkpointer becomes the bottleneck. Action: swap checkpointer for Temporal. Schema stays, runtime changes.

Kill 2

Visual authoring demand.

Tenants demand visual workflow authoring (Zapier-style builder). Action: wrap the graph with a builder UI; do not replace the engine. Vellum's path.

Kill 3

LLM latency per node.

Claude tool-use inside nodes exceeds per-step SLA at volume. Action: drop LLM-in-node for deterministic code at those nodes; retain LLM-as-judge for eval.

Kill criteria we are not worried about

01Tenant wants a weird ERP. Handled by tenants/{slug}/connectors/. No core change.
02New node type needed. Add to agent_runtime/nodes/, wire into graph, bump agent version.
03Claude model upgrade. One SDK call. Prompts tested against eval set before cutover.

Red Team · Attacks not covered by the 3 kill criteria

Seven sharp attacks. Five we harden against in Week 1.

After picking the architecture we spawned a dedicated adversarial agent with one job: find failure modes the named kill criteria don't already cover. Seven landed. Ranked by damage if un-mitigated. The top one forces a rewrite if retrofitted late, so it moves into Week 1 scope.

#	Attack	Trigger scenario	Hits by	Mitigation
1	No replay determinism. Cannot reconstruct why a run produced wrong output.	Lamosa PO produced wrong quote at 3am. Resume from checkpoint, but Claude is non-deterministic, Odoo state changed, temporal grounding rolled. Post-mortem is speculative. Client demands explanation.	Month 3	Week 1 scope. Record every LLM call, connector call, and grounding read into `events`. Replay mode uses recorded responses. Cheap now, rewrite if retrofitted.
2	Checkpoint JSONB bloat. 40-page PDF in state = 8MB checkpoints.	First multi-page PDF tenant hits volume. 7 super-steps × 500 runs/day × 90-day retention = Postgres rows at hundreds of GB with TOAST pressure.	Month 2-3	Never blob in state. Store artifacts in Supabase Storage, only URIs + hashes in checkpoint. CI gate: `checkpoint_jsonb` < 256KB. Partition `steps` by month.
3	Prompt cache invalidation. One edit, 3x token bill, no error.	Engineer edits `tenants/aronlight/prompts/extract.md` to fix a bug. Cache hash changes. Next 48h: every Aronlight run pays full input token cost. Nobody notices because runs still succeed. Month-end bill is 3x forecast.	Week 3-4	Split `stable_system.md` (cached) from `tenant_variables.md` (hot). Prompt-diff CI gate flags cache-busting edits. Dashboard: `cache_hit_rate_by_tenant`, alert < 70%.
4	Tenant folder merge conflicts. Filesystem config has no version field.	Second engineer joins. A edits `connectors/oms.py` Monday. B edits `prompts/classify.md` Tuesday. They merge. Prompt expects a connector field B didn't add. Wednesday prod fails with KeyError that doesn't repro in staging.	Month 1-2	Immutable hash-pinned tenant bundle. `agent_versions.tenant_bundle_hash` locks prompts+connectors+schema as one unit. No hot-reload of `tenants/*` in prod.
5	Zombie runs from abandoned interrupts. No SLA on the task queue.	Reviewer raises interrupt on Decide. Task created. Reviewer quits, nobody reassigns. Run paused indefinitely. Three weeks later customer asks "where's my quote?"	Week 4-6	Every task has `sla_deadline`, `escalation_policy`, sweeper cron that reassigns or auto-decides-with-default after N hours. Task-aging dashboard per tenant.
6	Extraction schema drift. "Shared spine" is a lie when tenants have different docs.	Lamosa PO arrives with `lote`, `pallet_id`, `m2_per_box`, `tono` (ceramic shade batch). None fit Aronlight's extraction schema.	Month 2	Per-tenant canonical schema versioned with `agent_version`. Store `raw_jsonb` + `canonical_jsonb` separately. Decide consumes only canonical.
7	RLS + admin queries collide. Cross-tenant reads bypass RLS or leak one row.	Ops needs "all failed runs in last hour across tenants". RLS is org-scoped. Either use service role and bypass RLS everywhere, or write a parallel non-RLS admin schema. Subtle policy bug leaks one Lamosa row into Aronlight during a JOIN.	Month 2	Two access planes day one: tenant plane (RLS via Supabase auth), ops plane (service role, audited). pgTAP test suite tries cross-tenant reads from every table on every migration.

Week 1 hardening decisions

Five changes land in the scaffold before the first Aronlight run.

01Event sourcing captures replay inputs. Every LLM call (prompt + response + model + params), every connector call (request + response), every grounding read (doc_id + version). events.payload_jsonb carries the record. Replay mode switches the client to read-from-events.
02Checkpoint size budget. Artifacts live in Supabase Storage. checkpoint_jsonb carries URIs + SHA-256 hashes only. Runtime asserts < 256KB; CI fails the migration otherwise.
03Immutable tenant bundles. agent_versions.tenant_bundle_hash is SHA of the full tenants/{slug}/ tree at version-bump time. Prod loads bundles by hash, never hot-reloads files.
04Task SLA fields. tasks.sla_deadline, tasks.escalation_policy_jsonb. A sweeper cron runs every 15 min to reassign or auto-resolve stalled tasks per policy.
05Two access planes. Tenant plane uses Supabase anon key + RLS. Ops plane uses service role with an audit event on every query. pgTAP suite runs in CI.

Alternatives kept on the shelf

Two variants we can pivot to if the winner breaks.

Alt 1

Pure LLM-orchestrated loop.

Ship if

Week 1 is existential and auditability risk is deferred.

Trade

Loses deterministic path for regulated flows. "Claude skipped credit check" is not defensible.

Migration cost

Low. Same schema. Unwrap the graph, keep the tools.

Alt 2

Temporal + Dust-shape.

Ship if

We hit 100+ tenants or multi-hour concurrent agents.

Trade

Heavyweight for today. Worth it when concurrency bites.

Migration cost

Medium. Activities map 1:1 to our nodes. Checkpointer replaced, schema reused.

V2 Hazards

What the schema cannot express cleanly.

Flagged now, deferred to v2. Not blocking MVP.

01Workspaces inside an org. A single tenant may want regional teams (Lamosa MX vs Lamosa PE). Needs a workspaces layer between orgs and everything else.
02Events table will balloon. Once we log every tool call, events grows fast. Partition by month or split a run_steps table.
03Cross-tenant shared learnings. teach_agent is per-org today. No path yet for "all lighting mfrs benefit from this correction."
04Embedding model versioning. grounding_rows.embedding has no model-version column. Changing embedding models silently breaks similarity search.
05Task DAG queries. unblocks_task_ids[] works for shallow chains, not graph queries. v2: task_edges table.
06Soft-delete / GDPR. No tombstone pattern on sources/grounding_rows. Needed before EU tenants.

Research Sources

Prior art the decision rests on.

Public architecture posts, docs, and postmortems from platforms that already made these calls.

LangGraph interrupts · docs.langchain.com LangGraph repo · github.com/langchain-ai Dust on Temporal · temporal.io Dust Deep Dive · dust.tt Three pillars of AI observability · braintrust.dev Vellum Workflows SDK GA · vellum.ai Vellum common architectures · docs.vellum.ai Humanloop sunset · respan.ai Inngest Agent Kit · agentkit.inngest.com Gumloop on E2B · e2b.dev

Scaffold the real repo at ../agent-platform/. Stand up Supabase schema from Agent D's SQL with the five Week 1 hardening changes folded in (event sourcing for replay, checkpoint size budget, immutable tenant bundles, task SLA, two access planes). Implement the 7-node graph with one pass of Aronlight RFQ end to end before touching Lamosa. Target: Aronlight hello-world inside 5 working days with replay determinism from day one.