News — Peter Niu

A daily AI/EdTech news digest — and a worked example of how to test whether an AI system matches your judgment.

Every morning at 7am Eastern, niu-news hands me at most five news items worth my time. The AI/EdTech firehose is fifty stories a day — niu-news reads all of them and cuts forty-five. It’s a small piece of personal infrastructure, and a worked example of the question that matters most for any AI system: how do you know it’s actually matching your judgment?

The point isn’t that everyone should build their own news pipeline. It’s that “AI-powered curation” is a claim you’ll hear from a dozen SaaS tools this year — and once you’ve built the scoring rubric, the eval framework, and the regression gate yourself, you know exactly which questions to ask before buying one.

TL;DR

How a three-stage pipeline (fetch → score → digest) filters signal from noise
The four-criterion scoring rubric and why “any one passes” beats “weighted average”
How to build an eval framework: golden set, replay, judge, regression gate
Why the eval is the part that would break first if you removed it

The problem: too much signal, no time

I want to stay current on three things: frontier model capability, emerging best practices that practitioners are stress-testing, and structural shifts in K12 policy. The sources — Hacker News, RSS feeds, GitHub Trending, Google News — surface those signals, but also a lot of noise. Launches that don’t matter. Hot takes. Posts restating yesterday’s news.

Reading them all takes an hour I don’t have. Skipping them means missing things I’d want to know. niu-news sits in the middle — reads everything, scores each item against a strict rubric, and only escalates what clears the bar.

The building blocks

Everything here has a free tier or pay-as-you-go pricing. Total cost: about a dollar a month.

Component	Role	Why this one
AWS Lambda	Serverless runtime	Billed by the millisecond; no servers to keep running
EventBridge	Scheduler	Wakes each Lambda on its own cron — 6h / 6h+15m / daily 7am ET
DynamoDB	Item + verdict + digest store	Fast, cheap, free up to 25 GB
S3	Source registry	One `sources.json` file; edit it, next run picks up the change
Claude Haiku 4.5	Scorer (hot path)	Small and cheap enough to run on every news item
Claude Sonnet 4.6	Eval judge (audit path)	Larger model grades the scorer’s reasoning quality
MCP + REST	Interface	AI tools connect via MCP; everything else via REST

Cost architecture. Small model on the hot path, big model on the audit path. That’s the entire cost story — and it generalizes to any system where you need cheap decisions at volume plus expensive quality checks on the side.

How a news item moves through the system

The pipeline is three Lambda functions on independent schedules. They don’t call each other — they read from and write to DynamoDB, and EventBridge wakes each one up on its own clock.

Steps with a blue stripe are AI-powered. The intelligence is concentrated in one place — the scorer — and everything else around it is plumbing.

The fetcher

Every six hours, the fetcher wakes up, reads sources.json from S3, and polls each source:

RSS feeds — parsed directly
Hacker News — Algolia API, filtered to 150+ points
GitHub Trending — scraped with a 50-star floor
Google News — a handful of narrow queries

For each item, it computes url_hash = SHA256(url) and writes to DynamoDB with a conditional put — if that hash exists, the write fails silently. That’s the entire dedup mechanism. No lookup, no in-memory set — the database refuses the duplicate. Each item lives 30 days via DynamoDB TTL, then disappears.

A typical run: 30–80 new items, each tagged scored="false".

The scorer

Fifteen minutes after each fetcher run, the scorer wakes up. It pulls items where scored="false" (oldest first, capped at 80) and calls Haiku with a locked prompt — same prompt every time, version-controlled, never edited on the fly.

The prompt asks Haiku to evaluate the item against four criteria. The item clears the bar if any one of them is met:

Haiku returns strict JSON: {clears_bar, criterion, confidence, reason}. If it parses, that’s the verdict. If it doesn’t — malformed output, missing field, anything weird — the result collapses to a safe “no.”

Err toward silence. The system has exactly one fallback behavior, and it favors empty over wrong. An empty digest tomorrow is fine. A digest full of wrong items is not. This principle generalizes: when an AI system fails, what’s its default? Make sure you chose it deliberately.

The digest builder

Once a day at 7am Eastern, the digest builder queries for items where clears_bar=true, confidence="high", and delivered=false. It ranks by criterion priority (capability and community first, frontier next, K12 policy last), caps at five, and writes a digest row keyed by date.

If nothing cleared the bar, the row says has_items=false. Empty days are normal — the system surfaces them honestly rather than scraping the barrel for filler.

MCP and REST read directly from that digest row. No separate delivery step.

How do you know the scorer is any good?

You can read the verdicts and squint. You can spot-check a few items and decide they look reasonable. That’s how most AI features ship — and how you end up with a system that quietly drifts from what you want.

The honest version: build an eval framework. A labeled set of test cases, a way to replay your system against them, and metrics that tell you whether the system’s judgment matches yours.

For niu-news, that meant hand-labeling 67 real items from the fetcher’s history — clears the bar or doesn’t, and which criterion. That’s the golden set: ground truth the scorer is tested against.

The eval loop has four phases:

Phase 1 — golden set

Load 67 hand-labeled items. Each is a real news item paired with my judgment: clears the bar or doesn’t, and which criterion it best fits.

Phase 2 — measurement

Replay the scorer against every item using the production prompt and model. Compare each verdict to mine, then compute:

Precision — of items the scorer passed, how many actually should? Low precision = cluttered digest.
Recall — of items that should pass, how many did the scorer catch? Low recall = missed signals.
F1 — harmonic mean of both. Punishes you for being bad at either.
Criterion agreement — when we both say “clears,” do we agree why? Catches right-verdict-wrong-reason brittleness.
Confidence calibration — when the scorer says “high confidence,” is it actually right more often? Calibration failures are quiet and dangerous.

Phase 3 — judge

Verdict comparison tells you whether the scorer got it right. It doesn’t tell you whether the reasoning was defensible.

A second AI — Sonnet 4.6 — reads the scorer’s stated reason on each item and grades it on two axes. Is the reason specific (cites something from the source) or generic (would apply to anything)? Is it faithful (accurately describes the source) or hallucinated?

The judge itself needs calibration. It’s tested against a subset of hand-graded reasons — current agreement: 92.3%.

Phase 4 — regression gate

Everything wraps in a pytest suite that blocks deployment. The thresholds:

Metric	Minimum	Why this number
F1	≥ 0.75	Below this, too many wrong verdicts either direction
Reason quality	≥ 80%	Specific and faithful, per the judge
Run-to-run consistency	≥ 85%	Scorer is non-deterministic; tolerance is calibrated
Judge “unknown” rate	≤ 10%	If the judge can’t tell, the eval isn’t measuring

If any threshold trips, the build fails. The change doesn’t ship.

Why this part matters

You could replace every component of niu-news — different model, different sources, different storage — and it would still work. The eval framework is the part that would break first if removed, even though you wouldn’t notice for weeks.

Without evals, every change is a guess. Tweak the prompt to fix one bad verdict, create three new ones. Upgrade the model, lose calibration. Add a criterion, watch precision quietly collapse. No way to tell — there’s no measurement. With evals, every change is a hypothesis you can test.

The golden set encodes my editorial judgment — nobody else’s labels would produce the right answers for me. Generic news-relevance scoring is a solved problem for advertisers. It’s a useless problem for me. The 67 labels bridge that gap, and the eval suite is the part of niu-news that’s most specifically, irreducibly mine.

The stack, annotated

Layer	Service	What it does	Free tier?
Runtime	AWS Lambda + API Gateway	Runs code on demand — no servers to manage	Yes (1M requests/mo)
Schedule	EventBridge	Wakes Lambdas on cron (every 6h, daily 11 UTC)	Yes
Item store	DynamoDB	Stores news items, scoring verdicts, daily digests	Yes (25GB)
Source registry	S3	One `sources.json` file listing feeds and queries	Yes (5GB)
Scorer	Claude Haiku 4.5	Reads each item, applies 4-criterion rubric	Pay-per-use (~$1/mo total)
Eval judge	Claude Sonnet 4.6	Grades the scorer’s reasoning quality	Pay-per-use
Interface	MCP + REST	Two doors in: AI tools use MCP, anything else uses REST	Open protocol
Auth	Lambda Authorizer	Resolves identity from JWT or x-api-key	Included with Lambda
Reliability	SQS DLQ	Catches failed Lambda invocations for inspection	Yes (1M requests/mo)

Six Lambdas, two tables, one bucket, one queue. The whole thing fits inside the free tier of most AWS services; the only line item that costs real money is Haiku, and it runs about a dollar a month.

Why build this yourself?

You could subscribe to a newsletter. There are good ones. But a newsletter encodes someone else’s editorial judgment, and the gap between their bar and mine is where the wasted hour lives.

niu-news works because the rubric is mine, the sources are mine, and the eval framework is something only I can build. A generic model would surface every Claude release. My rubric cares about which ones change a design decision.

The interesting part isn’t the Lambda or the DynamoDB schema. It’s the small, specific judgment encoded in code, and the measurement loop that keeps it honest.