Library
A personal research library searchable by AI — and a worked example of what makes semantic search actually good.
I have a folder of research papers and textbooks I’ve collected over years of working in EdTech — cognitive theory, measurement sciences, learning sciences, statistics, UX/LX research. niu-library makes that collection searchable by AI. I ask a question in natural language, and Claude reads the passages that actually answer it before responding.
When an enterprise team asks “should we use Vectara or build on Pinecone?” the answer depends on chunking strategy, reranking quality, and hybrid search trade-offs that are invisible from a vendor demo. Building niu-library is how I learned to see them — and the mental model transfers directly to evaluating any RAG platform.
This walkthrough follows one question — “What does the literature say about worked examples in math education?” — from query to answer, through every step in between.
TL;DR
- Why keyword search fails for research libraries — and what replaces it
- Two-stage retrieval: why one pass of vector search isn’t enough
- Parent-child chunking: search small units, read big sections
- The full stack from question to answer, annotated
The problem: keyword search is the wrong tool
Remember Ctrl+F inside a PDF? You remember the idea but not the exact words. You search for “worked examples” and miss the chapter that calls them “fully worked-out solutions.” You search for “cognitive load” and get fifty hits in a reference list before you find a paragraph of actual content.
The traditional fix is full-text indexing — basically Google for your PDFs. Better than Ctrl+F, but it still searches for words. It can’t tell you that “the expertise reversal effect” and “cognitive load increases with prior knowledge” are talking about the same thing.
What you want is to search by meaning. Ask “worked examples in math,” and the system surfaces passages about Sweller, about example-problem pairs, about faded scaffolding — whether or not they use your exact phrasing.
The building blocks
Every question travels through several systems before an answer comes back. Here’s the cast:
| Component | Role | Why this one |
|---|---|---|
| AWS Lambda + API Gateway | Serverless runtime and front door | No servers to manage; free tier covers all usage |
| S3 | PDF storage | Upload triggers the ingestion pipeline automatically |
| Pinecone | Vector database | Matches by meaning, not keywords; supports hybrid dense+sparse search |
Voyage AI voyage-4 | Embedding model | Text → 1,024-number vectors representing meaning |
Voyage AI rerank-2.5 | Cross-encoder reranker | Reads each candidate with the query for precise scoring |
| pymupdf4llm | PDF → structured markdown | Preserves headers, lists, paragraphs — splits at natural seams |
| Claude Haiku | Metadata extraction | Reads first pages to extract title, author, year, domain |
| MCP | Integration protocol | Any AI tool connects without custom wiring |
How a question becomes an answer
Here’s what happens when I ask Claude about worked examples.
Steps with a blue stripe are AI-powered. Dashed boxes are inputs and outputs.
Embedding the question
The question goes to Voyage’s voyage-4 model, which returns a vector of 1,024 numbers — same model, same dimensions as the chunks already sitting in Pinecone.
Asymmetric embeddings. niu-library calls Voyage with input_type="query"; ingestion used input_type="document". Modern embedding models produce slightly different vectors for the same text depending on whether it’s being asked or answered. Asking and answering live in slightly different spaces.
Two namespaces in parallel
The library is split into two Pinecone namespaces: papers (research articles, under 50 pages) and books (textbooks and longer texts). Each uses its own chunk size — 512 tokens for papers, 1,024 for books — because a paper’s argument unfolds tightly while a textbook chapter has more room to breathe.
Both namespaces get searched in parallel, each returning 40 best candidates. The score is a hybrid of two signals:
- Dense vectors — the meaning-based match described above.
- Sparse vectors — Pinecone’s
pinecone-sparse-english-v0model, closer to classical keyword matching but smarter. Catches cases where the exact phrase matters, like “Sweller (1988)” or “split-attention effect.”
Dense catches meaning. Sparse catches specificity. Hybrid scoring uses both.
A noise pool of 80 candidates
At this point the system has 80 candidate passages. Some are excellent. Some are tangentially related. Some are off-topic — a software docs page that uses the phrase “worked example,” or a textbook section on math anxiety that mentions worked problems in passing.
This is where early naive RAG fails. Take the top 8 by cosine similarity, hand them to the LLM, and hope. The LLM cites whatever it receives — even passages with nothing useful to say about the question.
The core problem with single-stage RAG. Embedding similarity is a rough approximation of relevance, not a precise one. Two passages can land near your query for completely different reasons. That’s what the next step fixes.
Two-stage retrieval
This is the single pattern that separates RAG systems that work from those that don’t. Almost every production system in early 2026 uses it.
Why one stage isn’t enough
When you embed a query and find the nearest chunks in vector space, you’re using a bi-encoder. The query and each passage were embedded separately, ahead of time. The comparison is just a distance calculation. Fast. Cheap. Also imprecise.
The imprecision is structural. A 1,024-number vector compresses an entire passage into a single point. Two passages can land near your query for different reasons — one because it answers your question, another because it shares vocabulary. The bi-encoder has no way to tell those apart. It never reads the query and the passage together.
A cross-encoder does. You hand it the query and one passage as a single input. It reads them jointly — attending to how each word in the query relates to each word in the passage — and outputs a single relevance score. Far more accurate. Far more expensive. You can’t run it across millions of chunks. But 80 candidates? Easy.
- Stage 1 (broad, cheap): Bi-encoder embeddings cast a wide net — a pool of plausible matches. Guarantees recall.
- Stage 2 (narrow, precise): Cross-encoder reranker reads each candidate alongside the query. Delivers precision.
What the quality jump looks like
Each box below is a passage. Outlined boxes are actually relevant; muted ones are near-misses the embedding pulled in for surface reasons.
The reranker doesn’t invent better passages — they were already in the pool of 80. It reorders. The math anxiety chapter got pulled in by surface similarity (“example” appears in the text); the reranker, having actually read it alongside the query, demotes it. Passages ranked twentieth or fortieth by cosine — but that genuinely answer the question — rise to the top.
The difference. RAG that hallucinates plausibly vs. RAG that quotes the right book comes down to whether you rerank.
Why the first stage still matters
If reranking is better, why not rerank everything? Cost. The cross-encoder reads each passage with the query — running it against 1,500 chunks would take seconds and burn budget. The bi-encoder embeds everything once at ingestion time; at query time it’s one embedding and a fast geometric lookup.
Cheap-and-broad first, expensive-and-precise second. The first stage’s job isn’t to be right — it’s to be inclusive enough that the right answer is in its pool.
Why chunks have parents
A search hit is a small chunk — 512 or 1,024 tokens, maybe half a page. Enough to score well on relevance, but often not enough to answer. The paragraph saying “worked examples reduce extraneous load” is the right find. The two paragraphs around it — the experimental setup, the implications — are what you actually need to read.
niu-library solves this with a parent-child hierarchy. Each PDF splits into parent chunks (2–4 KB sections, header-aligned) and child chunks (smaller units inside them). Only the children get embedded and searched. Each child carries a parent_id pointing back to its surrounding section.
Search small, read big. The search index holds small, focused units — that’s what gives embeddings their precision. But the LLM reads the larger surrounding section — that’s what makes answers trustworthy. Parent-child chunking gives you both.
In practice, search_docs returns 8 high-precision hits with parent_id references. A follow-up retrieve_parent call expands any of them into the full surrounding section. Claude picks which ones to expand based on what it needs.
What’s inside the corpus
The library currently holds 44 documents and roughly 1,500 vectors — papers and textbooks across six domains (AI/ML, learning design, measurement, education, statistics, other). Domain tags come from Haiku classification at ingestion time, with an S3 content fallback for documents whose metadata is too thin. You can filter searches by domain to scope your question.
The ingestion pipeline runs automatically when a PDF lands in S3:
- Classify — paper or book, by page count
- Extract — structured markdown via pymupdf4llm; strip references and indexes that would poison search
- Enrich — metadata via Haiku + Semantic Scholar lookups
- Embed — dense (Voyage
voyage-4) + sparse (Pinecone’s model) - Upload — to the matching namespace
Four Lambda functions handle the whole flow — MCP server, OAuth, authorizer, vectorizer — all deployed by SAM.
The stack, annotated
| Layer | Service | What it does | Free tier? |
|---|---|---|---|
| Runtime | AWS Lambda + API Gateway | Runs code on demand, no servers to manage | Yes (1M requests/mo) |
| Storage | S3 | Holds source PDFs; triggers vectorize on upload | Yes (5 GB) |
| Interface | MCP | One protocol any AI tool can plug into | Open standard |
| Auth | Lambda Authorizer + DynamoDB | JWT (OAuth 2.1) or x-api-key — resolves identity | Included |
| Extraction | pymupdf4llm | PDF → structured markdown, preserves headers | Open source |
| Metadata | Claude Haiku | Reads first pages, returns title/author/year/domain | Pay-per-use |
| Dense embedding | Voyage voyage-4 | Text → 1,024 numbers representing meaning | Free tier available |
| Sparse embedding | Pinecone sparse-english-v0 | Smart keyword signal for hybrid search | Included with Pinecone |
| Vector store | Pinecone | Stores and searches by meaning | Yes (100K vectors) |
| Reranking | Voyage rerank-2.5 | Cross-encoder relevance scoring on top candidates | Free tier available |
| Deploy | AWS SAM | One command to package and ship all four Lambdas | Free (tooling) |
Why build this yourself?
You can pay for a hosted RAG product. Some are good. But the seams are hidden — chunking strategy, embedding choice, hybrid scoring, two-stage retrieval, parent expansion — all wrapped in a box that returns “the answer.”
Build it yourself and the seams are visible. Swap the reranker, change the chunk size, try a different embedding model, and watch the quality change. You learn what good even means.
That’s the ethos behind every project on this site: build the real thing, then show how every layer works so anyone can follow along.