Factor
AdapttoAI Workspace · Architecture v0.1
Decision Doc · 2026-04-21

BLUF · Recommendation

LangGraph-shaped: structured 7-node spine, LLM inside nodes, interrupt-based tasks.

Pick a structured agent graph as the contract, let the LLM have tool-use latitude inside nodes, and treat every human decision as a first-class interrupt that checkpoints the run. Not a pure LLM loop (sacrifices auditability Lamosa's credit-override needs). Not a heavyweight DAG engine (too slow for Week 1). Supabase carries the state; Python workers run the graph; Claude lives inside the nodes that need it.

2 tenants Week 1 10+ tenants no fork Tasks first-class Python + FastAPI Supabase + Claude

Convergence · Four parallel research streams

What all four agents agreed on.

Four subagents ran in parallel: prior art (Humanloop, Braintrust, Dust, Vellum, LangSmith, Inngest, Gumloop, n8n), red-team of a 4-primitive spine, three architecture variants, and a data-model pressure test. They converged on five decisions.

Dimension Converged answer Where it came from
Execution model Durable graph with checkpointed state per step. Nodes typed, topology explicit, LLM calls live inside specific nodes. Agent A (LangGraph checkpointer), Agent B (7-node spine), Agent C (loop + pre/post hooks)
Human-in-the-loop Interrupt pattern. Any node can raise. Task row = resume token. One primitive covers approval, ambiguity, exception. Agent A (LangGraph interrupt()), Agent B (Review as edge), Agent D (tasks.run_id)
Tenant customization Shared core graph. Per-tenant overrides live in files (prompts, connectors), not forks. Versioned with the agent. Agent A (tenants/{slug}/...), Agent D (agent_versions + grounding_refs[])
Grounding Temporal. Grounding rows versioned by valid_from / valid_to. Runs reference catalog state at their start time, not "now". Agent D (grounding_rows)
LLM layer Claude via Anthropic SDK. Prompt caching on the system prompt. Extractor and Decide nodes get LLM; Normalize, Compose, Dispatch stay deterministic. Agent A (prompt caching), Agent B (node typing), Agent C (hooks around LLM)

Decision Matrix

Three variants scored against our constraints.

Scored 0 to 5 against our actual situation (2 wildly different first tenants, Python, Claude, Week 1 target, tasks-as-primitive, small team, 10+ tenants in 12 months). LangGraph-shaped wins because it takes the top score on the two criteria we cannot compromise on: Week 1 ship and audit of the human decision.

Criterion Pure LLM loop
Variant 2
LangGraph-shaped
recommended
Heavy DAG engine
Prefect / Temporal / Dagster
Week 1 buildable
Auditability of HITL path
Client #3 onboarding cost
Aronlight + Lamosa both fit
Observability / debug
Migration path if we outgrow it
Team size fit (1-2 eng)

The Spine

Seven nodes. Review as an edge. Tasks orthogonal.

The 4-primitive sketch (Source to Extractor to Grounder to Reviewer) broke when walked through both tenants end to end. Classify, Normalize, Decide, Compose, Dispatch were missing. Review cannot be a terminal node because exceptions fire mid-pipeline. Tasks cannot be a Reviewer output because any node emits them. Final shape:

  Inbound
    │
    ▼
  Source      ingest raw artifact (email + attachments, PDF, fax image, Excel, webhook)
    │
    ▼
  Classify    is this RFQ / PO / amendment / cancellation / status-inquiry?  cheap gate
    │
    ▼
  Normalize   OCR fax, parse Excel, strip HTML, unify units.  pre-LLM work
    │
    ▼
  Extract     LLM → structured payload against typed schema.  candidate values, unresolved refs
    │
    ▼
  Ground      resolve refs against tenant catalog + customer master + pricing.  temporal lookup
    │
    ▼
  Decide      business rules.  credit limits, stock policy, approval threshold.  LLM only when rules ambiguous
    │
    ▼
  Compose     assemble output artifact (quote PDF / OMS JSON / quote email)
    │
    ▼
  Dispatch    deliver + confirm + log.  non-skippable post-hook
    │
    ▼
  Outbound

        ┌────────────────────────────────────────────────────┐
        │  Review edge.  emitted by ANY node.                     │
        │  Flow pauses, state checkpointed, task row created. │
        │  Resume on human decision via task.resume_token.      │
        └────────────────────────────────────────────────────┘

        ┌────────────────────────────────────────────────────┐
        │  Tasks are orthogonal.                                  │
        │  One queue, cross-agent, cross-tenant within org.   │
        │  Question + reasoning + pinned context + options.   │
        └────────────────────────────────────────────────────┘
Design Move 1

Ground vs Decide split.

Ground answers "what is this?" (SKU resolution, customer lookup). Decide answers "what do we do?" (credit, stock, pricing policy). Keeps LLM-adjacent work separate from business rules. Decide is mostly deterministic with an LLM escape hatch when rules are ambiguous.

Design Move 2

Compose vs Dispatch split.

Compose builds the artifact (quote PDF, OMS payload, email body). Dispatch delivers it and confirms receipt. Different failure modes, different retry logic. Dispatch is a non-skippable post-hook: the LLM cannot route around it.

Design Move 3

Review is an edge, not a node.

Any node can raise a review task with interrupt() semantics. Flow pauses, state is checkpointed, a task row is written. Human resolves, flow resumes at the node the task specifies. One mechanism covers every HITL pattern.

Design Move 4

Tasks are cross-cutting, not terminal.

Tasks are not Reviewer's output. They are the cross-agent human queue. Any node emits. Any human queue consumes. Reviewer becomes "the default task handler" for approval-class tasks, not a spine position.

System Architecture

How the pieces fit.

                        ┌────────────────────────────────────────────────────────────┐
                        │                       Factor Web UI                          │
                        │   Next.js + Tailwind + Shadcn.  Tasks inbox, Agent Builder.  │
                        │   Auth via Supabase.  Real-time task updates via Realtime.   │
                        └──────────┬──────────────────────────────────────┬──────────┘
                                   │ REST / Realtime                      │ REST
                                   ▼                                      ▼
       ┌─────────────────────────────────────┐    ┌──────────────────────────────────────────┐
       │         Agent Runtime (Python)            │    │             Supabase                       │
       │                                       │    │                                            │
       │   FastAPI ingress (webhooks, API)     │    │   Postgres: orgs, agents, agent_versions,  │
       │   Graph executor                      │◄──►│   runs, steps, tasks, reviews, extractions,│
       │   Node registry (typed)               │    │   sources, grounding_docs, grounding_rows, │
       │   Checkpointer (per super-step)       │    │   events  (append-only audit)              │
       │   Interrupt → task writer             │    │                                            │
       │   Prompt-cache-enabled Claude client  │    │   Storage: raw emails, PDFs, attachments   │
       │                                       │    │   Auth: org-scoped RLS on every table      │
       │                                       │    │   Realtime: tasks channel → UI             │
       └──────┬────────────┬──────────┬────────┘    └──────────────────────────────────────────┘
              │            │          │
              ▼            ▼          ▼
       ┌──────────┐  ┌──────────┐  ┌────────────────────────────────┐
       │  Claude  │  │   OCR    │  │  Tenant connectors (per org)   │
       │   API    │  │ (mistral │  │  tenants/aronlight/odoo.py     │
       │ + cache  │  │  OCR /   │  │  tenants/aronlight/prompts/    │
       │          │  │  tesser) │  │  tenants/lamosa/oms.py         │
       └──────────┘  └──────────┘  │  tenants/lamosa/prompts/       │
                                    └────────────────────────────────┘
Shared Core

One graph, one runtime.

The 7-node spine is defined once in agent_runtime/graph.py. Every agent is a configured instance of this graph with tenant-specific prompts and connectors injected at run start.

Per-Tenant Extensions

Files, not forks.

Tenant customization lives in tenants/{slug}/: prompts as markdown, connectors as Python. Loaded by slug at run start. No core code changes per new tenant.

Checkpointing

Every step is durable.

After each node completes, full graph state is written to steps.checkpoint_jsonb. Process crash = resume from last checkpoint. Review task = resume when human resolves.

Walkthroughs · Both tenants through the spine

Aronlight (quote-gen) end to end.

Inbound RFQ email from Grupo Ribeiro. Four line items, partial SKU hints, one ambiguous spec. The spine produces a quote in ~2 minutes. One task emitted for the ambiguous SKU. Review lives in the Tasks inbox.

SOURCE
IMAP fetch. Email + 1 attachment (PDF with customer purchase history). Stored raw in sources.raw_uri. Parsed envelope + headers to sources.parsed_jsonb.
Auto
CLASSIFY
LLM call with tiny prompt. Decision: rfq. Alternate candidates: follow_up (0.04), complaint (0.01). Confidence 0.94. Proceeds.
Auto
NORMALIZE
PDF attachment has no OCR need (text extractable). Body stripped to plain text. Currency signals: EUR. Units: mixed (inches, cm). Normalizer flags mixed units for Extractor prompt context.
Auto
EXTRACT
Claude Sonnet with cached system prompt. Schema: rfq_line[]. 4 lines extracted. Line 3 raw: "8" PVC 40 SW Caps x 200". Candidate SKUs with rationale, not resolved.
Auto
GROUND
3 lines match Odoo catalog cleanly (0.92, 0.88, 0.95). Line 3 returns three candidates at 0.67, 0.41, 0.31. Grounder raises an interrupt.
Task raised
— pause —
Task Q-2026-0417-L3: "Which SKU matches '8" PVC 40 SW Caps'?" Ranked options CAP-80035 / CAP-8040G / CAP-80CPV. Pinned source snippet. Teach-the-agent toggle. Impact score: €1,340 (line value).
In queue
GROUND (resume)
Human picks CAP-80035 with teach-agent on. Ground re-runs with the correction applied. Line 3 resolved. Teach lesson written to grounding memory.
Auto
DECIDE
Customer: Ribeiro, existing, tier B. Value €5,240. Under €10k threshold → no approval required. Payment: net-30 default. No exceptions. Auto-proceed.
Rules check
COMPOSE
Quote template (Spanish, Aronlight brand). Line items + subtotal + terms + lead times pulled per-SKU. Rendered to PDF. Email body composed with Spanish greeting template.
Auto
DISPATCH
Reply email sent via tenant SMTP. Quote logged to Odoo as draft quote_draft_id. CRM activity written. Run completes.
Auto

Lamosa (PO-processing) end to end.

Inbound PO from a retailer Lamosa hasn't processed before. New layout, fax image, customer on credit hold. Two tasks emitted, routed to different human queues. The spine holds.

SOURCE
Email with fax image attachment (TIFF). Sender: new retailer "Casa Azul SA". Stored raw in sources.raw_uri.
Auto
CLASSIFY
Decision: po. Confidence 0.88. Proceeds.
Auto
NORMALIZE
OCR via Mistral OCR (better than Tesseract on fax). Low OCR confidence on shipping address block. Layout classifier: unseen. Normalizer raises interrupt to train the layout.
Task raised
— pause —
Task PO-2026-0044-LAYOUT: "First time seeing this PO layout from 'Casa Azul'. Train the agent?" Preview of layout. Options: train (auto-detect fields) / skip (manual-extract once) / reject. Routes to ops queue.
In queue · ops
EXTRACT
Resumes with layout template stored. Claude extracts: customer, PO#, 12 line items, requested ship date. Confidence 0.91.
Auto
GROUND
SKU matching against Lamosa catalog: 11 clean, 1 discontinued. Customer "Casa Azul" not in master → creates placeholder, flags for Decide.
Auto
DECIDE
Rules check: new customer without credit profile. Rule: new customers must have credit check before PO load. Raises interrupt routed to AR team queue.
Task raised
— pause —
Task PO-2026-0044-CREDIT: "New customer 'Casa Azul SA' needs credit check before PO load. Value $18,420." Ranked: approve tier A / approve tier B / partial ship pending credit / reject. Routes to AR queue. Impact score reflects PO value.
In queue · AR
DECIDE (resume)
AR resolves with "approve tier B". Decide re-runs. Rule passes with customer now tier B. Discontinued SKU: substitution rule applies. Auto-proceed.
Rules check
COMPOSE
OMS JSON payload built. Lamosa OMS schema. Line items mapped with substituted SKU noted. Validated against OMS schema before dispatch.
Auto
DISPATCH
POST to Lamosa OMS. 200 OK. OMS order ID returned. Confirmation email sent to Casa Azul contact. Run completes.
Auto

Data Model

One schema. Both tenants. No fork.

Ten tables. Multi-tenant from row one via org_id with RLS on every table. Agent versioning via immutable agent_versions so config changes never rewrite history. Grounding versioned temporally. Tasks first-class with a resume token. Events append-only for audit.

  tenant layer
  ┌─────────┐     ┌──────────┐     ┌───────────────────┐
  │  orgs   │◄────┤  agents  │────►│  agent_versions   │  immutable, versioned
  └─────────┘     └──────────┘     │  config_jsonb     │
                                    │  prompts_jsonb    │
                                    │  tools_jsonb      │
                                    │  grounding_refs[] │
                                    └─────────┬─────────┘
                                              │
  runtime layer                           │
  ┌──────────┐     ┌──────────┐     ┌────────▼─────────┐
  │ sources  │────►│   runs   │────►│   steps          │  checkpoint_jsonb per super-step
  └──────────┘     │  status  │     │   node_type      │
                   │  conf    │     │   input_jsonb    │
                   └─────┬────┘     │   output_jsonb   │
                         │          └──────────────────┘
                         │
                         ├───────►┌──────────────┐
                         │        │ extractions  │  schema_name + schema_hash + payload_jsonb
                         │        └──────────────┘
                         │
                         └───────►┌──────────────┐
                                  │    tasks     │  first-class, cross-agent human queue
                                  │  question    │
                                  │  reasoning   │
                                  │  pinned_ctx  │
                                  │  options[]   │
                                  │  unblocks[]  │
                                  │  impact      │
                                  └──────┬───────┘
                                         │
                                         └───►┌─────────┐
                                              │ reviews │  decision + teach_payload
                                              └─────────┘

  reference layer
  ┌─────────────────┐     ┌──────────────────┐
  │ grounding_docs  │◄────┤ grounding_rows   │  temporal:  valid_from, valid_to
  │  (catalog,      │     │  external_id     │
  │   customers,    │     │  payload_jsonb   │
  │   pricing)      │     │  embedding       │
  └─────────────────┘     └──────────────────┘

  audit layer
  ┌─────────────┐
  │   events    │  append-only.  REVOKE UPDATE/DELETE.  aggregate_type + aggregate_id + event_type
  └─────────────┘

Schema tensions we resolved

Tenant Layout

Shared core, per-tenant files.

The repo has one core graph. Each tenant lives in a folder. Adding client #3 is a new folder plus a few config rows, not a new service.

  agent-platform/
    agent_runtime/
      graph.py            # the 7-node spine, one definition for everyone
      nodes/
        source.py
        classify.py
        normalize.py
        extract.py
        ground.py
        decide.py
        compose.py
        dispatch.py
      checkpointer.py
      interrupts.py       # writes tasks table, reads resume token
      tools/              # shared tool implementations
    api/
      main.py             # FastAPI: webhooks, REST, task resolve
    tenants/
      aronlight/
        prompts/
          classify.md
          extract_rfq.md
          compose_quote.md
        connectors/
          odoo.py         # catalog + customer pulls, quote push
          smtp.py
        config.yaml       # thresholds, feature flags
      lamosa/
        prompts/
          classify.md
          extract_po.md
          compose_oms.md
        connectors/
          oms.py          # OMS push, customer master upload
          ocr.py          # fax OCR pipeline
        config.yaml
    tests/
    migrations/           # Supabase schema migrations

Kill Criteria

What makes us rewrite this.

Every architecture has a breaking point. Naming ours now so we know when to act.

Kill 1

Concurrency scale.

~100 tenants with multi-hour concurrent runs. Our hand-rolled checkpointer becomes the bottleneck. Action: swap checkpointer for Temporal. Schema stays, runtime changes.

Kill 2

Visual authoring demand.

Tenants demand visual workflow authoring (Zapier-style builder). Action: wrap the graph with a builder UI; do not replace the engine. Vellum's path.

Kill 3

LLM latency per node.

Claude tool-use inside nodes exceeds per-step SLA at volume. Action: drop LLM-in-node for deterministic code at those nodes; retain LLM-as-judge for eval.

Kill criteria we are not worried about

Red Team · Attacks not covered by the 3 kill criteria

Seven sharp attacks. Five we harden against in Week 1.

After picking the architecture we spawned a dedicated adversarial agent with one job: find failure modes the named kill criteria don't already cover. Seven landed. Ranked by damage if un-mitigated. The top one forces a rewrite if retrofitted late, so it moves into Week 1 scope.

# Attack Trigger scenario Hits by Mitigation
1 No replay determinism.
Cannot reconstruct why a run produced wrong output.
Lamosa PO produced wrong quote at 3am. Resume from checkpoint, but Claude is non-deterministic, Odoo state changed, temporal grounding rolled. Post-mortem is speculative. Client demands explanation. Month 3 Week 1 scope. Record every LLM call, connector call, and grounding read into events. Replay mode uses recorded responses. Cheap now, rewrite if retrofitted.
2 Checkpoint JSONB bloat.
40-page PDF in state = 8MB checkpoints.
First multi-page PDF tenant hits volume. 7 super-steps × 500 runs/day × 90-day retention = Postgres rows at hundreds of GB with TOAST pressure. Month 2-3 Never blob in state. Store artifacts in Supabase Storage, only URIs + hashes in checkpoint. CI gate: checkpoint_jsonb < 256KB. Partition steps by month.
3 Prompt cache invalidation.
One edit, 3x token bill, no error.
Engineer edits tenants/aronlight/prompts/extract.md to fix a bug. Cache hash changes. Next 48h: every Aronlight run pays full input token cost. Nobody notices because runs still succeed. Month-end bill is 3x forecast. Week 3-4 Split stable_system.md (cached) from tenant_variables.md (hot). Prompt-diff CI gate flags cache-busting edits. Dashboard: cache_hit_rate_by_tenant, alert < 70%.
4 Tenant folder merge conflicts.
Filesystem config has no version field.
Second engineer joins. A edits connectors/oms.py Monday. B edits prompts/classify.md Tuesday. They merge. Prompt expects a connector field B didn't add. Wednesday prod fails with KeyError that doesn't repro in staging. Month 1-2 Immutable hash-pinned tenant bundle. agent_versions.tenant_bundle_hash locks prompts+connectors+schema as one unit. No hot-reload of tenants/* in prod.
5 Zombie runs from abandoned interrupts.
No SLA on the task queue.
Reviewer raises interrupt on Decide. Task created. Reviewer quits, nobody reassigns. Run paused indefinitely. Three weeks later customer asks "where's my quote?" Week 4-6 Every task has sla_deadline, escalation_policy, sweeper cron that reassigns or auto-decides-with-default after N hours. Task-aging dashboard per tenant.
6 Extraction schema drift.
"Shared spine" is a lie when tenants have different docs.
Lamosa PO arrives with lote, pallet_id, m2_per_box, tono (ceramic shade batch). None fit Aronlight's extraction schema. Month 2 Per-tenant canonical schema versioned with agent_version. Store raw_jsonb + canonical_jsonb separately. Decide consumes only canonical.
7 RLS + admin queries collide.
Cross-tenant reads bypass RLS or leak one row.
Ops needs "all failed runs in last hour across tenants". RLS is org-scoped. Either use service role and bypass RLS everywhere, or write a parallel non-RLS admin schema. Subtle policy bug leaks one Lamosa row into Aronlight during a JOIN. Month 2 Two access planes day one: tenant plane (RLS via Supabase auth), ops plane (service role, audited). pgTAP test suite tries cross-tenant reads from every table on every migration.
Week 1 hardening decisions

Five changes land in the scaffold before the first Aronlight run.

  • 01Event sourcing captures replay inputs. Every LLM call (prompt + response + model + params), every connector call (request + response), every grounding read (doc_id + version). events.payload_jsonb carries the record. Replay mode switches the client to read-from-events.
  • 02Checkpoint size budget. Artifacts live in Supabase Storage. checkpoint_jsonb carries URIs + SHA-256 hashes only. Runtime asserts < 256KB; CI fails the migration otherwise.
  • 03Immutable tenant bundles. agent_versions.tenant_bundle_hash is SHA of the full tenants/{slug}/ tree at version-bump time. Prod loads bundles by hash, never hot-reloads files.
  • 04Task SLA fields. tasks.sla_deadline, tasks.escalation_policy_jsonb. A sweeper cron runs every 15 min to reassign or auto-resolve stalled tasks per policy.
  • 05Two access planes. Tenant plane uses Supabase anon key + RLS. Ops plane uses service role with an audit event on every query. pgTAP suite runs in CI.

Alternatives kept on the shelf

Two variants we can pivot to if the winner breaks.

Alt 1

Pure LLM-orchestrated loop.

Ship if
Week 1 is existential and auditability risk is deferred.
Trade
Loses deterministic path for regulated flows. "Claude skipped credit check" is not defensible.
Migration cost
Low. Same schema. Unwrap the graph, keep the tools.
Alt 2

Temporal + Dust-shape.

Ship if
We hit 100+ tenants or multi-hour concurrent agents.
Trade
Heavyweight for today. Worth it when concurrency bites.
Migration cost
Medium. Activities map 1:1 to our nodes. Checkpointer replaced, schema reused.

V2 Hazards

What the schema cannot express cleanly.

Flagged now, deferred to v2. Not blocking MVP.

Research Sources

Prior art the decision rests on.

Public architecture posts, docs, and postmortems from platforms that already made these calls.

Next

Scaffold the real repo at ../agent-platform/. Stand up Supabase schema from Agent D's SQL with the five Week 1 hardening changes folded in (event sourcing for replay, checkpoint size budget, immutable tenant bundles, task SLA, two access planes). Implement the 7-node graph with one pass of Aronlight RFQ end to end before touching Lamosa. Target: Aronlight hello-world inside 5 working days with replay determinism from day one.