Day 3: Chunking — The Make-or-Break Decision in RAG

Effective RAG systems rely on optimal text chunking. Explore strategies like fixed-size, overlapping, and semantic methods for improved results.

P
Parathan Thiyagalingam
May 7, 20264 min read
Day 3: Chunking — The Make-or-Break Decision in RAG

Today, we zoom in on the step that happens before embedding: chunking. It quietly decides whether your RAG system is amazing or unusable.

You can have the best embedding model and the fanciest vector DB. Bad chunking will still ruin your RAG.

Why Chunk At All?

Why not just embed the entire document as a single giant vector? Three reasons:

  1. Context window limits — LLMs can't read 200-page PDFs in a single prompt.
  2. Retrieval precision — A "refund policy" question needs one paragraph, not the whole handbook.
  3. Embedding quality — Embedding a whole book averages it into vague mush. Smaller pieces = sharper meaning.

So we split. The question is how.

Every chunking strategy fights the same tradeoff:

  1. Too small: Loses context -> pronouns lose their referents
  2. Too big: Multiple ideas dilute each other → noisy retrieval
  3. Just right: One coherent idea per chunk

There's no universal "correct" size, but there are sensible defaults.

Strategy 1: Fixed-Size Chunking

The simplest approach. Just split text into equal pieces (e.g., 500 tokens each).

Document: "The quick brown fox jumps over the lazy dog. The dog barked..."

Chunk 1: "The quick brown fox jumps over the lazy do"
Chunk 2: "g. The dog barked..."
↑ disaster
  1. Dead simple
  2. Cuts mid-sentence, mid-word, mid-thought
Tip: Always count by tokens, not characters. Tokens are how LLMs and embedding models actually perceive text.

Strategy 2: Overlapping Chunks (Sliding Window)

Same as fixed-size, but each chunk overlaps the next.

Doc: [────────────────────────────────────────]
Chunk 1: [───────────]
Chunk 2: [───────────]
Chunk 3: [───────────]
↑↑↑↑↑ overlap ↑↑↑↑↑

If an important sentence falls right at a boundary, it still appears fully in at least one chunk. Safety net.

Typical settings: 500-token chunks with 50–100 token overlap.

  1. Preserves context across boundaries
  2. The best beginner default
  3. Slightly more storage (overlap is embedded twice)

Strategy 3: Semantic Chunking

Split where the meaning changes, not at arbitrary intervals.

How: break the doc into sentences, embed each one, and start a new chunk whenever similarity to the previous sentence drops sharply (a topic shift).

"The Roman Empire began in 27 BC." ← topic: Rome
"Augustus became its first emperor." ← still Rome
"Meanwhile, in China, the Han dynasty..." ← topic shift! NEW CHUNK
  1. Highest-quality, most coherent chunks-
  2. Slower and more expensive (you're embedding everything to chunk it)

One Bonus Should Know: Recursive Splitting

The default in libraries like LangChain. It tries to split on natural separators in order of preference: paragraphs → sentences → words → characters.

It's almost as fast as fixed-size and far smarter. Try this before reaching for semantic chunking. It's a sweet spot for most projects.

A Practical Starter Recipe

If you're building your first RAG system tomorrow:

Strategy: Recursive splitting (or fixed-size + overlap)
Chunk size: 500 tokens
Overlap: 50–100 tokens
Pre-clean: Strip headers, footers, page numbers
Sanity check: Print 10 random chunks. Do they make sense alone?

Iterate from there based on actual retrieval results.

Common Beginner Mistakes

  1. Splitting by character count instead of tokens. Token counts are what actually matter to the model.
  2. No overlap on dense technical content. Adjacent sentences reference each other constantly. Without overlap, those connections vanish.
  3. Never inspecting actual chunks. Always print samples before embedding. You'll catch most issues in 30 seconds.

Notes:

Why do we chunk documents?

3 reasons:

  1. LLM context windows are limited
  2. Smaller chunks improve retrieval precision
  3. Embeddings degrade in quality on very long text.

Fixed-size vs overlapping vs semantic chunking?

  1. Fixed-size: equal pieces, fast but breaks sentences.
  2. Overlapping: fixed-size with overlap, preserves cross-boundary context. Best beginner default.
  3. Semantic: splits at meaning shifts, highest quality, and more expensive.

What's a sensible default? Recursive splitting (or fixed-size + overlap), ~500 tokens with ~50–100 token overlap.

Why use overlap? So info that lands near a chunk boundary still appears fully in at least one chunk.

Why count tokens instead of characters? Tokens are how LLMs and embedding models actually count text; character counts can wildly mislead.

Day 3 Takeaway

Chunking is where data prep meets retrieval quality. Start simple — fixed-size + overlap or recursive splitting — inspect your actual chunks, and only get fancier if results demand it.

Coming Up on Day 4

In the next session, we will check PDF processing.

See you on Day 4.