Reports Pipeline — Peter Niu

Agent pipeline for analytical reports — and a worked example of why constrained agents beat autonomous ones for long-running workflows.

When I need to produce an analytical deliverable — an impact report, a research brief, anything where the numbers matter — I run it through octo. It’s a pipeline of seven specialized Claude agents, each with strict boundaries on what it can see and do, connected by locked artifacts and mechanical validators. Every number in the final report traces to a specific cell in a locked dataset through a validated citation chain.

Most of what you read about AI agents this year is swarm-shaped: autonomous agents negotiating with each other, planning their own work, deciding when they’re done. Octo is the opposite pattern. A fixed pipeline — linear, stage-gated, boring on purpose. This walkthrough shows why that constraint is what makes long-running AI workflows actually deliver.

When you’re evaluating multi-agent orchestration frameworks — CrewAI, LangGraph, AutoGen — the question isn’t which one has more features. It’s whether the framework’s architecture matches the kind of workflow you’re running. Building a pipeline from scratch teaches you which constraints are load-bearing and which are just packaging.

TL;DR

Long AI sessions accumulate “context rot” — models start reconstructing numbers from memory instead of from data. A pipeline breaks the session into fresh, bounded stages.
Each stage produces locked artifacts the next stage reads. Nothing gets recomputed; nothing drifts.
Python validators check structural contracts between stages — not whether the analysis is good, but whether citations resolve and numbers match.
The result: every number in the final report traces back to a specific computation through a chain you can audit in minutes.

The problem: numbers you can’t trace

Ask Claude to produce an impact report from a dataset. It’ll do it — analysis, charts, prose, executive summary, the works. Two hours of back-and-forth in a long session. The result will look polished. And somewhere in the middle, there’ll be a number you can’t trace.

Maybe the model computed a metric in the analysis and cited a slightly different version in the summary. Maybe it rounded inconsistently. Maybe it derived a number from what it remembered instead of from what it computed. You can’t tell from reading the report — the prose is fluent, the charts are clean, the number looks plausible.

Plausible is not traceable.

This gets worse the longer the session runs. Earlier computations drift out of the context window. The model starts reconstructing numbers from memory instead of from data. Context rot is the technical term. “Made-up numbers in a professional report” is the practical one.

The pattern: a pipeline, not a swarm

Swarms suit exploration and creative work. When the deliverable is a report someone will make decisions from, exploration is the wrong target. You want traceability — every number should have a provenance chain: this finding came from this analysis script, which read this locked dataset, which was built from these raw sources.

Octo enforces that with a pipeline: a fixed sequence of stages, each producing artifacts the next stage reads. No stage recomputes what a previous stage locked. No agent sees the conversation history from earlier stages — it reads files instead. State lives on disk, not in a context window that degrades over time.

Three properties make the pattern work:

Fresh context per stage. Each agent starts a clean session. No context rot across stages — the only state that carries forward is the locked files on disk.

Artifact locking. Once the dataset is locked, it never gets recomputed. Once findings are extracted, they become the single source of truth for every number downstream. Lock once, read many.

Mechanical validation. Python scripts check contracts between stages — not whether the analysis is good (that’s human review’s job), but whether structural invariants hold: do citations resolve, do numbers match, are required fields present?

Fresh context is a feature, not a workaround. The instinct is to pass everything forward — give each agent the full history so it “understands” what came before. The problem is that long contexts degrade: models start reconstructing earlier outputs from memory rather than re-reading them. Breaking the session at each stage, and making the agent read files instead of inheriting conversation history, is what keeps each stage’s reasoning clean.

How a number gets from raw data to the final report

Here’s the path one metric takes through the pipeline — from a column in a CSV to a sentence in the deliverable.

Steps with a blue stripe are AI-powered — every stage runs a Claude agent. Dashed boxes are inputs and outputs. Validator checkpoints (left side, in monospace) run between stages.

Scope and ground truth

Stage 0 sets the contract. The brief defines the deliverable’s audience, format, voice, analytical conventions, and a one-sentence takeaway that every downstream stage serves. This is the only stage with direct human interaction — after the brief is approved, the orchestrator dispatches everything else.

Stage 2 builds the analytical dataset. An R or Python script reads the raw sources, reshapes them, computes derived columns, and writes analytical.parquet. Then lock_dataset.py runs: it hashes the file, captures the schema, records which raw sources and scripts produced it, and seals everything into MANIFEST.yaml. From this point, that parquet file is ground truth. No downstream stage can modify it — if anyone does, the hash won’t match.

Analysis and findings

Stage 3 reads the locked dataset and produces the analysis: charts, tables, summary statistics. Every script reads from analytical.parquet and writes to analysis/outputs/. The agent can’t modify the dataset. If it finds a data problem, it goes back to Stage 2 and rebuilds — no patching on the fly.

Stage 4 is the unusual one. Instead of letting the narrative writer pull numbers straight from analysis outputs, this stage extracts every quantitative claim into a structured registry: findings.yaml. Each finding gets an ID (F001, F002…), the actual value, a confidence level, and a source locator pointing to exactly where the number came from.

A findings registry converts scattered outputs into a source of truth. The instinct is to let the narrative writer read analysis outputs directly — they’re right there. The problem is that “right there” means different things across charts, tables, and summary files. The findings stage forces a single canonical claim for each number before any prose is written. Everything downstream cites by ID, not by value. The finding is the number.

Narrative and trust

Stage 5 writes the report prose. Every quantitative claim must include an inline citation — [F001] — pointing back to the finding. The agent can’t introduce numbers from memory or derive new ones from existing findings. If a number isn’t in findings.yaml, it doesn’t go in the narrative.

Stage 6 is the adversarial audit. A fresh agent — different session, no shared context with the writer — reads the narrative and hunts for problems: fabricated numbers, citation errors, value drift (the narrative says “43%” but finding F012 says “42.7%”), overclaiming beyond what the data supports. It produces a claims_audit.md table before making any fixes, so the problems are documented before they’re corrected.

validate_narrative.py runs after both stages and checks three things mechanically:

Every [F###] citation resolves to a finding that actually exists
Every finding marked must_cite: true appears somewhere in the narrative
Every quantitative claim — percentages, sample sizes, dollar amounts — has a citation within 80 characters

That last check is the anti-fabrication net. An agent that invents a number will almost always forget to cite it, because there’s nothing to cite. The validator catches the gap.

Mechanical validation handles what AI judgment cannot. You could ask the audit agent to check whether all citations resolve correctly — but you’d be trusting one LLM to reliably catch the errors of another. A Python script that checks [F###] against findings.yaml is deterministic, fast, and doesn’t hallucinate. Use AI for judgment; use code for contracts.

What constrains each agent

The pipeline’s reliability comes from what each agent can’t do, not what it can.

Each stage agent lives in a markdown file (.claude/agents/octo-NN-*.md) with explicit boundaries: what it may read, what it must produce, what it must refuse. Most refusals start with “do NOT recompute” or “do NOT modify.” Each constraint exists because of a specific failure mode I hit during development.

Stage 3 can read the locked dataset but can’t modify it. Stage 5 can read findings but can’t add new ones — if it needs a number that isn’t in findings.yaml, it stops and says so. Stage 6 must produce an audit before making corrections. These aren’t suggestions in a system prompt; they’re hard constraints in each agent’s instructions, enforced mechanically by the validators.

The model assignment matters for both cost and reliability:

Opus handles judgment-heavy stages: analysis, narrative, adversarial QA
Sonnet handles mechanical work: data lock, findings extraction, assembly

You don’t need Opus to extract structured findings from analysis outputs. You don’t want Sonnet writing publication-quality prose. Right-sizing the model per stage is part of what makes the pipeline economical to run.

The orchestrator dispatches stages in order, runs validators between them, and pauses at human QA gates. If a validator fails, the pipeline stops. If human review catches something, the orchestrator archives current artifacts, re-runs the affected stage, and cascades to downstream stages. Prior versions are archived, never deleted — nothing is lost when you revise.

What the citation chain looks like

Here’s traceability at each stage — one number, through the whole pipeline:

# Stage 2 — data/MANIFEST.yaml (locked)
dataset: analytical.parquet
sha256: "3a7f..."
schema:
  - { name: absenteeism_pct_pre, dtype: float }
locked_at: 2026-04-11T09:55:00Z

# Stage 4 — findings/findings.yaml
- id: F012
  value: 42.7
  description: "Chronic absenteeism fell 42.7 pp in treatment cohort"
  source:
    dataset: data/analytical.parquet
    script: analysis/absenteeism_cohort.R
  confidence: high

# Stage 5 — narrative/narrative.md
"...chronic absenteeism fell by 42.7 percentage points [F012]
in treatment districts..."

# validate_narrative.py
# ✓ [F012] resolves to findings.yaml
# ✓ "42.7" matches F012.value
# ✓ No uncited quantitative claims nearby

The dataset hash proves the data hasn’t changed. The finding locator points to the exact computation. The citation links prose to finding. The validator checks the chain mechanically. You can audit the whole report by reading findings.yaml and claims_audit.md — the pipeline built the audit trail for you.

The stack, annotated

Layer	What	What it does
Orchestration	Claude Code subagents	Each stage = one agent with explicit scope boundaries
Models	Opus / Sonnet	Opus for judgment (analysis, narrative, QA); Sonnet for mechanics (data lock, findings, assembly)
Validation	Python scripts	Mechanical checks between stages — hashes, citations, value drift
State	Files on disk	Parquet, YAML, markdown — inspectable at every stage
Revisions	Archive + cascade	Prior versions kept; downstream stages re-run when upstream changes

No cloud services. No databases. No external APIs beyond the Claude API that Claude Code already uses. The whole thing runs locally, in your project directory, with files you can read at every step.

Why build this yourself?

If you’ve used AI for analytical work — reports, summaries, research briefs — you’ve hit the trust problem. The output looks right. You can’t easily verify it is right, because the numbers live in a conversation you’d have to re-read line by line to audit.

The pipeline pattern makes provenance structural instead of conversational. Every number has a file, a finding ID, a citation, and a validator checking the chain. You don’t audit by re-reading the session — you read findings.yaml and claims_audit.md, and the tracing is already done.

Pipelines vs. swarms: choose based on what “done” means. Autonomous swarms optimize for exploration — agents negotiate, plan, and decide when they’re finished. That’s powerful for open-ended tasks. But for a deliverable with a fixed definition of correct (every number cited, every citation valid), you need a deterministic exit condition, not emergent consensus. The pipeline’s linear structure isn’t a limitation; it’s what makes “done” mean something.

That’s the pattern worth taking from this — not the specific stages or the YAML schema, which are implementation details for my use case. The pattern is: constrain your agents, lock your artifacts, validate your contracts, and never let a number exist without a source. Pipelines are boring. For work where the numbers have to be right, boring is the feature.