Welcome to AIEdTalks’ Newsletter!

In today's edition:

  • Why your RAG sucks (and the 3-line fix)

  • The 2-Minute Rule — Stop Hoarding Tiny Tasks

Let’s dive in.

Today’s Edition

AI TOPIC
Why your RAG sucks (and the 3-line fix)

Last month I spent three full days debugging a RAG pipeline for a side project. The LLM kept hallucinating. My first instinct? Swap the model. I tried GPT-4o, Claude, Gemini — all of them.

None of them fixed it.

Turns out I'd been blaming the wrong part of the stack the whole time. Here's what I learned — and the one change that took my answer quality from "meh" to "actually useful."Your RAG returns 10 chunks. 8 are garbage.

Your RAG returns 10 chunks. 8 are garbage.

If you've built anything with retrieval-augmented generation, you've seen this: you ask a question, your vector DB returns the top 10 chunks, and the answer is still mediocre.

Here's what nobody tells you: out of those 10 chunks, maybe 2 are actually relevant. The other 8 are semantically similar noise.

Similar ≠ relevant. That's the whole problem.

When I logged every chunk my retriever returned and read them manually, I was horrified. My query was "How do I handle rate limits in the API?" and I was getting chunks about API authentication, API versioning, and API deprecation policies — all "API" content, none of it about rate limits.

The LLM wasn't hallucinating. It was doing its best with garbage input.

Why this happens

Vector search does exactly one thing: it finds chunks whose embeddings sit close to your query's embedding. That's it. It has no idea what you actually need.

A chunk about "API authentication best practices" lives near "API rate limit errors" in embedding space because they share vocabulary. A naive top-k retrieval can't tell them apart. You're trusting a single similarity score to judge a nuanced question.

That's like hiring a librarian who only reads book titles.

The fix: retrieve wide, rerank narrow

Here's the two-stage pattern that changed everything for me:

  1. Retrieve more, not less. Pull top 25–50 chunks instead of top 10. Yes, more noise — but that's fine because of step 2.

  2. Rerank with a cross-encoder. A reranker (Cohere Rerank, BGE Reranker) actually reads your query AND each chunk together, then scores how relevant each one is. Not similar. Relevant.

  3. Pass only the top 3–5 reranked chunks to your LLM.

Here's the counterintuitive part: shrinking your context window improves answers. Fewer, better chunks beat more, worse chunks every single time. The "Lost in the Middle" research confirms it — LLMs ignore content buried in long contexts anyway.

Before vs after in my pipeline

  • Before: top 10 from vector search → LLM → roughly 60% answer quality (vibes-based eval, I won't lie)

  • After: top 50 → rerank → top 5 → LLM → roughly 90% answer quality, and faster because the context is smaller

The takeaway

Stop blaming the LLM. Your retrieval layer is almost always the weakest link. Add a reranker before you change anything else — model, prompt, or chunking strategy. It's the single highest-ROI fix in the RAG stack.

Quick picks this week

  1. Cohere Rerank v3 — Plug-and-play reranking API. 3 lines of code, massive quality jump.

  2. BGE Reranker (open source) — Self-hosted option if you're privacy-conscious or cost-sensitive.

  3. Anthropic's Contextual Retrieval — Prepend context to each chunk before embedding. Reportedly cuts retrieval failures by ~49%.

  4. RAGAS — Evaluate your RAG pipeline with real metrics instead of vibes.

  5. "Lost in the Middle" paper (Liu et al.) — The research showing why stuffing long contexts backfires.

Tip of the week

Add a reranker today — it's 3 lines of code.

If you're on LlamaIndex or LangChain, plugging in Cohere Rerank is genuinely trivial: pull 25 chunks, pipe them through the reranker, take the top 5. You'll see the jump immediately. Budget: 5 minutes and about $0.001 per query.

Try it right now. You'll never go back.

Reply prompt: What's the most broken part of your AI stack right now? Hit reply — I read every response, and some of them become future editions.

PRODUCTIVITY TUTORIAL

The 2-Minute Rule — Stop Hoarding Tiny Tasks

The Context: David Allen, author of Getting Things Done, has one rule that's probably saved more hours than any productivity app ever will: "If an action takes less than two minutes, do it the moment it shows up." Most of your daily stress isn't from big projects — it's from the dozens of tiny, unfinished things piling up in your brain. Reply to that email. File that receipt. Put the dish away. Each one seems trivial. Together, they're loud.

Step-by-step:

  1. When a new task lands — email, message, request, thought — ask one question: "Will this take less than 2 minutes?"

  2. If yes — do it immediately. Not later. Not after you finish this one thing. Now.

  3. If no — capture it in your task list and move on. Don't half-process it.

The magic: by the end of the day, you've handled 30+ things without them ever entering your "stuff to worry about" pile.

Give it a try!

👋 That’s All Folks!

Before you go, just a few public service announcements:

  • Do you have a topic in mind you'd like us to cover? DM me 

  • Looking to sponsor AIEdTalks’ Newsletter? DM me, and we’ll get back to you asap.

See you soon,

AIEdTalks’ Newsletter Team

Recommended for you