Peter Niu
Projects / AI Agent Infrastructure

News

A daily AI/EdTech news digest — and a worked example of how to test whether an AI system matches your judgment.

active
Updated May 20, 2026
15 min read

Every morning at 7am Eastern, niu-news hands me at most five news items worth my time. The AI/EdTech firehose is fifty stories a day — niu-news reads all of them and cuts forty-five. It’s a small piece of personal infrastructure, and a worked example of the question that matters most for any AI system: how do you know it’s actually matching your judgment?

The point isn’t that everyone should build their own news pipeline. It’s that “AI-powered curation” is a claim you’ll hear from a dozen SaaS tools this year — and once you’ve built the scoring rubric, the eval framework, and the regression gate yourself, you know exactly which questions to ask before buying one.

TL;DR

  • How a three-stage pipeline (fetch → score → digest) filters signal from noise
  • The four-criterion scoring rubric and why “any one passes” beats “weighted average”
  • How to build an eval framework: golden set, replay, judge, regression gate
  • Why the eval is the part that would break first if you removed it

The problem: too much signal, no time

I want to stay current on three things: frontier model capability, emerging best practices that practitioners are stress-testing, and structural shifts in K12 policy. The sources — Hacker News, RSS feeds, GitHub Trending, Google News — surface those signals, but also a lot of noise. Launches that don’t matter. Hot takes. Posts restating yesterday’s news.

Reading them all takes an hour I don’t have. Skipping them means missing things I’d want to know. niu-news sits in the middle — reads everything, scores each item against a strict rubric, and only escalates what clears the bar.

The building blocks

Everything here has a free tier or pay-as-you-go pricing. Total cost: about a dollar a month.

ComponentRoleWhy this one
AWS LambdaServerless runtimeBilled by the millisecond; no servers to keep running
EventBridgeSchedulerWakes each Lambda on its own cron — 6h / 6h+15m / daily 7am ET
DynamoDBItem + verdict + digest storeFast, cheap, free up to 25 GB
S3Source registryOne sources.json file; edit it, next run picks up the change
Claude Haiku 4.5Scorer (hot path)Small and cheap enough to run on every news item
Claude Sonnet 4.6Eval judge (audit path)Larger model grades the scorer’s reasoning quality
MCP + RESTInterfaceAI tools connect via MCP; everything else via REST

Cost architecture. Small model on the hot path, big model on the audit path. That’s the entire cost story — and it generalizes to any system where you need cheap decisions at volume plus expensive quality checks on the side.

How a news item moves through the system

The pipeline is three Lambda functions on independent schedules. They don’t call each other — they read from and write to DynamoDB, and EventBridge wakes each one up on its own clock.

Fetcher Poll RSS, Hacker News, GitHub Trending, Google News Deduplicate by URL hash, write unscored items EVERY 6H unscored items in DynamoDB Scorer Pull up to 80 unscored items, oldest first Apply 4-criterion rubric — does ANY criterion clear the bar? Write verdict, criterion, confidence, reason EVERY 6H + 15M scored items, clears_bar=true|false Digest Builder Filter clears_bar=true AND confidence=high AND undelivered Rank by criterion priority, cap at 5 items Write today's digest row (or has_items=false) DAILY 7AM ET At most 5 items, served via REST + MCP

Steps with a blue stripe are AI-powered. The intelligence is concentrated in one place — the scorer — and everything else around it is plumbing.

The fetcher

Every six hours, the fetcher wakes up, reads sources.json from S3, and polls each source:

  • RSS feeds — parsed directly
  • Hacker News — Algolia API, filtered to 150+ points
  • GitHub Trending — scraped with a 50-star floor
  • Google News — a handful of narrow queries

For each item, it computes url_hash = SHA256(url) and writes to DynamoDB with a conditional put — if that hash exists, the write fails silently. That’s the entire dedup mechanism. No lookup, no in-memory set — the database refuses the duplicate. Each item lives 30 days via DynamoDB TTL, then disappears.

A typical run: 30–80 new items, each tagged scored="false".

The scorer

Fifteen minutes after each fetcher run, the scorer wakes up. It pulls items where scored="false" (oldest first, capped at 80) and calls Haiku with a locked prompt — same prompt every time, version-controlled, never edited on the fly.

The prompt asks Haiku to evaluate the item against four criteria. The item clears the bar if any one of them is met:

"Anthropic releases Claude 4.7" title + summary + source Haiku reads it, tests against 4 criteria CLAUDE HAIKU 4.5 passes if ANY one criterion clears the bar CRITERION 1 Capability model capability, architectural pattern, or tool that changes design decisions CRITERION 2 Community emerging practice stress-tested by practitioners (HN, trending repos) CRITERION 3 K12 Policy regulatory or structural shift in K12 education CRITERION 4 Frontier advances what's known about frontier lab capabilities Strict JSON verdict { "clears_bar": true, "criterion": 1, "confidence": "high", "reason": "new model with..." } Eligible for digest Cut (still queryable)

Haiku returns strict JSON: {clears_bar, criterion, confidence, reason}. If it parses, that’s the verdict. If it doesn’t — malformed output, missing field, anything weird — the result collapses to a safe “no.”

Err toward silence. The system has exactly one fallback behavior, and it favors empty over wrong. An empty digest tomorrow is fine. A digest full of wrong items is not. This principle generalizes: when an AI system fails, what’s its default? Make sure you chose it deliberately.

The digest builder

Once a day at 7am Eastern, the digest builder queries for items where clears_bar=true, confidence="high", and delivered=false. It ranks by criterion priority (capability and community first, frontier next, K12 policy last), caps at five, and writes a digest row keyed by date.

If nothing cleared the bar, the row says has_items=false. Empty days are normal — the system surfaces them honestly rather than scraping the barrel for filler.

MCP and REST read directly from that digest row. No separate delivery step.

How do you know the scorer is any good?

You can read the verdicts and squint. You can spot-check a few items and decide they look reasonable. That’s how most AI features ship — and how you end up with a system that quietly drifts from what you want.

The honest version: build an eval framework. A labeled set of test cases, a way to replay your system against them, and metrics that tell you whether the system’s judgment matches yours.

For niu-news, that meant hand-labeling 67 real items from the fetcher’s history — clears the bar or doesn’t, and which criterion. That’s the golden set: ground truth the scorer is tested against.

Golden set 67 hand-labeled items verdict + criterion "what I would say" Replay the scorer Same prompt, same model, on each item in the set HAIKU 4.5 Compare verdicts, compute metrics precision · recall · F1 · criterion agreement confidence calibration · run-to-run consistency Judge the reasoning Did the scorer's stated reason actually justify the verdict? Specificity · faithfulness to the source text SONNET 4.6 Regression gate F1 ≥ 0.75 · reason quality ≥ 80% · consistency ≥ 85% If any threshold fails, the build fails Ship the change Block, investigate

The eval loop has four phases:

Phase 1 — golden set

Load 67 hand-labeled items. Each is a real news item paired with my judgment: clears the bar or doesn’t, and which criterion it best fits.

Phase 2 — measurement

Replay the scorer against every item using the production prompt and model. Compare each verdict to mine, then compute:

  • Precision — of items the scorer passed, how many actually should? Low precision = cluttered digest.
  • Recall — of items that should pass, how many did the scorer catch? Low recall = missed signals.
  • F1 — harmonic mean of both. Punishes you for being bad at either.
  • Criterion agreement — when we both say “clears,” do we agree why? Catches right-verdict-wrong-reason brittleness.
  • Confidence calibration — when the scorer says “high confidence,” is it actually right more often? Calibration failures are quiet and dangerous.

Phase 3 — judge

Verdict comparison tells you whether the scorer got it right. It doesn’t tell you whether the reasoning was defensible.

A second AI — Sonnet 4.6 — reads the scorer’s stated reason on each item and grades it on two axes. Is the reason specific (cites something from the source) or generic (would apply to anything)? Is it faithful (accurately describes the source) or hallucinated?

The judge itself needs calibration. It’s tested against a subset of hand-graded reasons — current agreement: 92.3%.

Phase 4 — regression gate

Everything wraps in a pytest suite that blocks deployment. The thresholds:

MetricMinimumWhy this number
F1≥ 0.75Below this, too many wrong verdicts either direction
Reason quality≥ 80%Specific and faithful, per the judge
Run-to-run consistency≥ 85%Scorer is non-deterministic; tolerance is calibrated
Judge “unknown” rate≤ 10%If the judge can’t tell, the eval isn’t measuring

If any threshold trips, the build fails. The change doesn’t ship.

Why this part matters

You could replace every component of niu-news — different model, different sources, different storage — and it would still work. The eval framework is the part that would break first if removed, even though you wouldn’t notice for weeks.

Without evals, every change is a guess. Tweak the prompt to fix one bad verdict, create three new ones. Upgrade the model, lose calibration. Add a criterion, watch precision quietly collapse. No way to tell — there’s no measurement. With evals, every change is a hypothesis you can test.

The golden set encodes my editorial judgment — nobody else’s labels would produce the right answers for me. Generic news-relevance scoring is a solved problem for advertisers. It’s a useless problem for me. The 67 labels bridge that gap, and the eval suite is the part of niu-news that’s most specifically, irreducibly mine.

The stack, annotated

LayerServiceWhat it doesFree tier?
RuntimeAWS Lambda + API GatewayRuns code on demand — no servers to manageYes (1M requests/mo)
ScheduleEventBridgeWakes Lambdas on cron (every 6h, daily 11 UTC)Yes
Item storeDynamoDBStores news items, scoring verdicts, daily digestsYes (25GB)
Source registryS3One sources.json file listing feeds and queriesYes (5GB)
ScorerClaude Haiku 4.5Reads each item, applies 4-criterion rubricPay-per-use (~$1/mo total)
Eval judgeClaude Sonnet 4.6Grades the scorer’s reasoning qualityPay-per-use
InterfaceMCP + RESTTwo doors in: AI tools use MCP, anything else uses RESTOpen protocol
AuthLambda AuthorizerResolves identity from JWT or x-api-keyIncluded with Lambda
ReliabilitySQS DLQCatches failed Lambda invocations for inspectionYes (1M requests/mo)

Six Lambdas, two tables, one bucket, one queue. The whole thing fits inside the free tier of most AWS services; the only line item that costs real money is Haiku, and it runs about a dollar a month.

Why build this yourself?

You could subscribe to a newsletter. There are good ones. But a newsletter encodes someone else’s editorial judgment, and the gap between their bar and mine is where the wasted hour lives.

niu-news works because the rubric is mine, the sources are mine, and the eval framework is something only I can build. A generic model would surface every Claude release. My rubric cares about which ones change a design decision.

The interesting part isn’t the Lambda or the DynamoDB schema. It’s the small, specific judgment encoded in code, and the measurement loop that keeps it honest.