BLUF · Recommendation
LangGraph-shaped: structured 7-node spine, LLM inside nodes, interrupt-based tasks.
Pick a structured agent graph as the contract, let the LLM have tool-use latitude inside nodes, and treat every human decision as a first-class interrupt that checkpoints the run. Not a pure LLM loop (sacrifices auditability Lamosa's credit-override needs). Not a heavyweight DAG engine (too slow for Week 1). Supabase carries the state; Python workers run the graph; Claude lives inside the nodes that need it.
Convergence · Four parallel research streams
What all four agents agreed on.
Four subagents ran in parallel: prior art (Humanloop, Braintrust, Dust, Vellum, LangSmith, Inngest, Gumloop, n8n), red-team of a 4-primitive spine, three architecture variants, and a data-model pressure test. They converged on five decisions.
| Dimension | Converged answer | Where it came from |
|---|---|---|
| Execution model | Durable graph with checkpointed state per step. Nodes typed, topology explicit, LLM calls live inside specific nodes. | Agent A (LangGraph checkpointer), Agent B (7-node spine), Agent C (loop + pre/post hooks) |
| Human-in-the-loop | Interrupt pattern. Any node can raise. Task row = resume token. One primitive covers approval, ambiguity, exception. | Agent A (LangGraph interrupt()), Agent B (Review as edge), Agent D (tasks.run_id) |
| Tenant customization | Shared core graph. Per-tenant overrides live in files (prompts, connectors), not forks. Versioned with the agent. | Agent A (tenants/{slug}/...), Agent D (agent_versions + grounding_refs[]) |
| Grounding | Temporal. Grounding rows versioned by valid_from / valid_to. Runs reference catalog state at their start time, not "now". |
Agent D (grounding_rows) |
| LLM layer | Claude via Anthropic SDK. Prompt caching on the system prompt. Extractor and Decide nodes get LLM; Normalize, Compose, Dispatch stay deterministic. | Agent A (prompt caching), Agent B (node typing), Agent C (hooks around LLM) |
Decision Matrix
Three variants scored against our constraints.
Scored 0 to 5 against our actual situation (2 wildly different first tenants, Python, Claude, Week 1 target, tasks-as-primitive, small team, 10+ tenants in 12 months). LangGraph-shaped wins because it takes the top score on the two criteria we cannot compromise on: Week 1 ship and audit of the human decision.
| Criterion | Pure LLM loop Variant 2 |
LangGraph-shaped recommended |
Heavy DAG engine Prefect / Temporal / Dagster |
|---|---|---|---|
| Week 1 buildable | |||
| Auditability of HITL path | |||
| Client #3 onboarding cost | |||
| Aronlight + Lamosa both fit | |||
| Observability / debug | |||
| Migration path if we outgrow it | |||
| Team size fit (1-2 eng) |
The Spine
Seven nodes. Review as an edge. Tasks orthogonal.
The 4-primitive sketch (Source to Extractor to Grounder to Reviewer) broke when walked through both tenants end to end. Classify, Normalize, Decide, Compose, Dispatch were missing. Review cannot be a terminal node because exceptions fire mid-pipeline. Tasks cannot be a Reviewer output because any node emits them. Final shape:
Inbound
│
▼
Source ingest raw artifact (email + attachments, PDF, fax image, Excel, webhook)
│
▼
Classify is this RFQ / PO / amendment / cancellation / status-inquiry? cheap gate
│
▼
Normalize OCR fax, parse Excel, strip HTML, unify units. pre-LLM work
│
▼
Extract LLM → structured payload against typed schema. candidate values, unresolved refs
│
▼
Ground resolve refs against tenant catalog + customer master + pricing. temporal lookup
│
▼
Decide business rules. credit limits, stock policy, approval threshold. LLM only when rules ambiguous
│
▼
Compose assemble output artifact (quote PDF / OMS JSON / quote email)
│
▼
Dispatch deliver + confirm + log. non-skippable post-hook
│
▼
Outbound
┌────────────────────────────────────────────────────┐
│ Review edge. emitted by ANY node. │
│ Flow pauses, state checkpointed, task row created. │
│ Resume on human decision via task.resume_token. │
└────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────┐
│ Tasks are orthogonal. │
│ One queue, cross-agent, cross-tenant within org. │
│ Question + reasoning + pinned context + options. │
└────────────────────────────────────────────────────┘
Ground vs Decide split.
Ground answers "what is this?" (SKU resolution, customer lookup). Decide answers "what do we do?" (credit, stock, pricing policy). Keeps LLM-adjacent work separate from business rules. Decide is mostly deterministic with an LLM escape hatch when rules are ambiguous.
Compose vs Dispatch split.
Compose builds the artifact (quote PDF, OMS payload, email body). Dispatch delivers it and confirms receipt. Different failure modes, different retry logic. Dispatch is a non-skippable post-hook: the LLM cannot route around it.
Review is an edge, not a node.
Any node can raise a review task with interrupt() semantics. Flow pauses, state is checkpointed, a task row is written. Human resolves, flow resumes at the node the task specifies. One mechanism covers every HITL pattern.
Tasks are cross-cutting, not terminal.
Tasks are not Reviewer's output. They are the cross-agent human queue. Any node emits. Any human queue consumes. Reviewer becomes "the default task handler" for approval-class tasks, not a spine position.
System Architecture
How the pieces fit.
┌────────────────────────────────────────────────────────────┐
│ Factor Web UI │
│ Next.js + Tailwind + Shadcn. Tasks inbox, Agent Builder. │
│ Auth via Supabase. Real-time task updates via Realtime. │
└──────────┬──────────────────────────────────────┬──────────┘
│ REST / Realtime │ REST
▼ ▼
┌─────────────────────────────────────┐ ┌──────────────────────────────────────────┐
│ Agent Runtime (Python) │ │ Supabase │
│ │ │ │
│ FastAPI ingress (webhooks, API) │ │ Postgres: orgs, agents, agent_versions, │
│ Graph executor │◄──►│ runs, steps, tasks, reviews, extractions,│
│ Node registry (typed) │ │ sources, grounding_docs, grounding_rows, │
│ Checkpointer (per super-step) │ │ events (append-only audit) │
│ Interrupt → task writer │ │ │
│ Prompt-cache-enabled Claude client │ │ Storage: raw emails, PDFs, attachments │
│ │ │ Auth: org-scoped RLS on every table │
│ │ │ Realtime: tasks channel → UI │
└──────┬────────────┬──────────┬────────┘ └──────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌────────────────────────────────┐
│ Claude │ │ OCR │ │ Tenant connectors (per org) │
│ API │ │ (mistral │ │ tenants/aronlight/odoo.py │
│ + cache │ │ OCR / │ │ tenants/aronlight/prompts/ │
│ │ │ tesser) │ │ tenants/lamosa/oms.py │
└──────────┘ └──────────┘ │ tenants/lamosa/prompts/ │
└────────────────────────────────┘
One graph, one runtime.
The 7-node spine is defined once in agent_runtime/graph.py. Every agent is a configured instance of this graph with tenant-specific prompts and connectors injected at run start.
Files, not forks.
Tenant customization lives in tenants/{slug}/: prompts as markdown, connectors as Python. Loaded by slug at run start. No core code changes per new tenant.
Every step is durable.
After each node completes, full graph state is written to steps.checkpoint_jsonb. Process crash = resume from last checkpoint. Review task = resume when human resolves.
Walkthroughs · Both tenants through the spine
Aronlight (quote-gen) end to end.
Inbound RFQ email from Grupo Ribeiro. Four line items, partial SKU hints, one ambiguous spec. The spine produces a quote in ~2 minutes. One task emitted for the ambiguous SKU. Review lives in the Tasks inbox.
sources.raw_uri. Parsed envelope + headers to sources.parsed_jsonb.rfq. Alternate candidates: follow_up (0.04), complaint (0.01). Confidence 0.94. Proceeds.rfq_line[]. 4 lines extracted. Line 3 raw: "8" PVC 40 SW Caps x 200". Candidate SKUs with rationale, not resolved.quote_draft_id. CRM activity written. Run completes.Lamosa (PO-processing) end to end.
Inbound PO from a retailer Lamosa hasn't processed before. New layout, fax image, customer on credit hold. Two tasks emitted, routed to different human queues. The spine holds.
sources.raw_uri.po. Confidence 0.88. Proceeds.Data Model
One schema. Both tenants. No fork.
Ten tables. Multi-tenant from row one via org_id with RLS on every table. Agent versioning via immutable agent_versions so config changes never rewrite history. Grounding versioned temporally. Tasks first-class with a resume token. Events append-only for audit.
tenant layer ┌─────────┐ ┌──────────┐ ┌───────────────────┐ │ orgs │◄────┤ agents │────►│ agent_versions │ immutable, versioned └─────────┘ └──────────┘ │ config_jsonb │ │ prompts_jsonb │ │ tools_jsonb │ │ grounding_refs[] │ └─────────┬─────────┘ │ runtime layer │ ┌──────────┐ ┌──────────┐ ┌────────▼─────────┐ │ sources │────►│ runs │────►│ steps │ checkpoint_jsonb per super-step └──────────┘ │ status │ │ node_type │ │ conf │ │ input_jsonb │ └─────┬────┘ │ output_jsonb │ │ └──────────────────┘ │ ├───────►┌──────────────┐ │ │ extractions │ schema_name + schema_hash + payload_jsonb │ └──────────────┘ │ └───────►┌──────────────┐ │ tasks │ first-class, cross-agent human queue │ question │ │ reasoning │ │ pinned_ctx │ │ options[] │ │ unblocks[] │ │ impact │ └──────┬───────┘ │ └───►┌─────────┐ │ reviews │ decision + teach_payload └─────────┘ reference layer ┌─────────────────┐ ┌──────────────────┐ │ grounding_docs │◄────┤ grounding_rows │ temporal: valid_from, valid_to │ (catalog, │ │ external_id │ │ customers, │ │ payload_jsonb │ │ pricing) │ │ embedding │ └─────────────────┘ └──────────────────┘ audit layer ┌─────────────┐ │ events │ append-only. REVOKE UPDATE/DELETE. aggregate_type + aggregate_id + event_type └─────────────┘
Schema tensions we resolved
- 01Extraction shape varies per tenant. Aronlight needs RFQ line items pointing at SKUs. Lamosa needs PO header + lines + ship-to. Resolved:
extractions.schema_name+schema_hash+ freepayload_jsonb. Schema registry lives in code, not DB, so tenants extend without DDL. - 02Grounding source of truth differs. Aronlight pulls Odoo via API. Lamosa uploads SKU master + customer list as files. Unified via
grounding_docs.source_system. Runs reference catalog state at their start time viagrounding_rows.valid_from. - 03Task granularity differs. Aronlight tasks are per-line ("which SKU?"). Lamosa tasks are per-PO-line or per-header ("customer not in master"). Same
tasksshape works becausepinned_context_jsonbcarries the relevant slice andimpact_scoreis computed per-tenant. - 04Agent versioning vs run history.
runs.agent_version_idpoints at immutableagent_versions, never at mutableagents. Config changes never rewrite what happened.
Tenant Layout
Shared core, per-tenant files.
The repo has one core graph. Each tenant lives in a folder. Adding client #3 is a new folder plus a few config rows, not a new service.
agent-platform/
agent_runtime/
graph.py # the 7-node spine, one definition for everyone
nodes/
source.py
classify.py
normalize.py
extract.py
ground.py
decide.py
compose.py
dispatch.py
checkpointer.py
interrupts.py # writes tasks table, reads resume token
tools/ # shared tool implementations
api/
main.py # FastAPI: webhooks, REST, task resolve
tenants/
aronlight/
prompts/
classify.md
extract_rfq.md
compose_quote.md
connectors/
odoo.py # catalog + customer pulls, quote push
smtp.py
config.yaml # thresholds, feature flags
lamosa/
prompts/
classify.md
extract_po.md
compose_oms.md
connectors/
oms.py # OMS push, customer master upload
ocr.py # fax OCR pipeline
config.yaml
tests/
migrations/ # Supabase schema migrations
Kill Criteria
What makes us rewrite this.
Every architecture has a breaking point. Naming ours now so we know when to act.
Concurrency scale.
~100 tenants with multi-hour concurrent runs. Our hand-rolled checkpointer becomes the bottleneck. Action: swap checkpointer for Temporal. Schema stays, runtime changes.
Visual authoring demand.
Tenants demand visual workflow authoring (Zapier-style builder). Action: wrap the graph with a builder UI; do not replace the engine. Vellum's path.
LLM latency per node.
Claude tool-use inside nodes exceeds per-step SLA at volume. Action: drop LLM-in-node for deterministic code at those nodes; retain LLM-as-judge for eval.
Kill criteria we are not worried about
- 01Tenant wants a weird ERP. Handled by
tenants/{slug}/connectors/. No core change. - 02New node type needed. Add to
agent_runtime/nodes/, wire into graph, bump agent version. - 03Claude model upgrade. One SDK call. Prompts tested against eval set before cutover.
Red Team · Attacks not covered by the 3 kill criteria
Seven sharp attacks. Five we harden against in Week 1.
After picking the architecture we spawned a dedicated adversarial agent with one job: find failure modes the named kill criteria don't already cover. Seven landed. Ranked by damage if un-mitigated. The top one forces a rewrite if retrofitted late, so it moves into Week 1 scope.
| # | Attack | Trigger scenario | Hits by | Mitigation |
|---|---|---|---|---|
| 1 | No replay determinism. Cannot reconstruct why a run produced wrong output. |
Lamosa PO produced wrong quote at 3am. Resume from checkpoint, but Claude is non-deterministic, Odoo state changed, temporal grounding rolled. Post-mortem is speculative. Client demands explanation. | Month 3 | Week 1 scope. Record every LLM call, connector call, and grounding read into events. Replay mode uses recorded responses. Cheap now, rewrite if retrofitted. |
| 2 | Checkpoint JSONB bloat. 40-page PDF in state = 8MB checkpoints. |
First multi-page PDF tenant hits volume. 7 super-steps × 500 runs/day × 90-day retention = Postgres rows at hundreds of GB with TOAST pressure. | Month 2-3 | Never blob in state. Store artifacts in Supabase Storage, only URIs + hashes in checkpoint. CI gate: checkpoint_jsonb < 256KB. Partition steps by month. |
| 3 | Prompt cache invalidation. One edit, 3x token bill, no error. |
Engineer edits tenants/aronlight/prompts/extract.md to fix a bug. Cache hash changes. Next 48h: every Aronlight run pays full input token cost. Nobody notices because runs still succeed. Month-end bill is 3x forecast. |
Week 3-4 | Split stable_system.md (cached) from tenant_variables.md (hot). Prompt-diff CI gate flags cache-busting edits. Dashboard: cache_hit_rate_by_tenant, alert < 70%. |
| 4 | Tenant folder merge conflicts. Filesystem config has no version field. |
Second engineer joins. A edits connectors/oms.py Monday. B edits prompts/classify.md Tuesday. They merge. Prompt expects a connector field B didn't add. Wednesday prod fails with KeyError that doesn't repro in staging. |
Month 1-2 | Immutable hash-pinned tenant bundle. agent_versions.tenant_bundle_hash locks prompts+connectors+schema as one unit. No hot-reload of tenants/* in prod. |
| 5 | Zombie runs from abandoned interrupts. No SLA on the task queue. |
Reviewer raises interrupt on Decide. Task created. Reviewer quits, nobody reassigns. Run paused indefinitely. Three weeks later customer asks "where's my quote?" | Week 4-6 | Every task has sla_deadline, escalation_policy, sweeper cron that reassigns or auto-decides-with-default after N hours. Task-aging dashboard per tenant. |
| 6 | Extraction schema drift. "Shared spine" is a lie when tenants have different docs. |
Lamosa PO arrives with lote, pallet_id, m2_per_box, tono (ceramic shade batch). None fit Aronlight's extraction schema. |
Month 2 | Per-tenant canonical schema versioned with agent_version. Store raw_jsonb + canonical_jsonb separately. Decide consumes only canonical. |
| 7 | RLS + admin queries collide. Cross-tenant reads bypass RLS or leak one row. |
Ops needs "all failed runs in last hour across tenants". RLS is org-scoped. Either use service role and bypass RLS everywhere, or write a parallel non-RLS admin schema. Subtle policy bug leaks one Lamosa row into Aronlight during a JOIN. | Month 2 | Two access planes day one: tenant plane (RLS via Supabase auth), ops plane (service role, audited). pgTAP test suite tries cross-tenant reads from every table on every migration. |
Five changes land in the scaffold before the first Aronlight run.
- 01Event sourcing captures replay inputs. Every LLM call (prompt + response + model + params), every connector call (request + response), every grounding read (doc_id + version).
events.payload_jsonbcarries the record. Replay mode switches the client to read-from-events. - 02Checkpoint size budget. Artifacts live in Supabase Storage.
checkpoint_jsonbcarries URIs + SHA-256 hashes only. Runtime asserts < 256KB; CI fails the migration otherwise. - 03Immutable tenant bundles.
agent_versions.tenant_bundle_hashis SHA of the fulltenants/{slug}/tree at version-bump time. Prod loads bundles by hash, never hot-reloads files. - 04Task SLA fields.
tasks.sla_deadline,tasks.escalation_policy_jsonb. A sweeper cron runs every 15 min to reassign or auto-resolve stalled tasks per policy. - 05Two access planes. Tenant plane uses Supabase anon key + RLS. Ops plane uses service role with an audit event on every query. pgTAP suite runs in CI.
Alternatives kept on the shelf
Two variants we can pivot to if the winner breaks.
Pure LLM-orchestrated loop.
Temporal + Dust-shape.
V2 Hazards
What the schema cannot express cleanly.
Flagged now, deferred to v2. Not blocking MVP.
- 01Workspaces inside an org. A single tenant may want regional teams (Lamosa MX vs Lamosa PE). Needs a
workspaceslayer betweenorgsand everything else. - 02Events table will balloon. Once we log every tool call,
eventsgrows fast. Partition by month or split arun_stepstable. - 03Cross-tenant shared learnings.
teach_agentis per-org today. No path yet for "all lighting mfrs benefit from this correction." - 04Embedding model versioning.
grounding_rows.embeddinghas no model-version column. Changing embedding models silently breaks similarity search. - 05Task DAG queries.
unblocks_task_ids[]works for shallow chains, not graph queries. v2:task_edgestable. - 06Soft-delete / GDPR. No tombstone pattern on
sources/grounding_rows. Needed before EU tenants.
Research Sources
Prior art the decision rests on.
Public architecture posts, docs, and postmortems from platforms that already made these calls.