Day 4: Chunking Continued - Semantic Chunking when meaning decides where to split

Smart chunking decides RAG quality. Learn semantic chunking with Sentence Transformers, when to use embedding-based chunking, and the cost tradeoffs that matter.

P
Parathan Thiyagalingam
May 10, 20266 min read
Day 4: Chunking Continued - Semantic Chunking when meaning decides where to split

This blog post is a daily learning summary of my 40-day RAG class from Syed Jaffer of Parotta Salna.

Small change of plan. Yesterday I told you today is PDF day, but chunking needs one more day. Once you see the code below, you'll get what chunking is & why.

Quick Reminder: Why We Even Chunk

Picture this. You ask your RAG bot:

"What's our refund policy for damaged items?"

If your handbook is stored as a single giant chunk, its embedding is the average of all topics in the book, like refunds, hiring, parking, harassment policy, and holidays. That vector points everywhere and nowhere.

Split the same handbook by topic. Now the "refund policy" section has its own sharp embedding, pointing straight at refund questions. Retrieval lights up.

Chunking groups related sentences so each embedding has one clear meaning the search can latch onto.

That's the whole point. Better chunking = better relevancy.

The Smart Approach: Semantic Chunking

Day 3 covered fixed-size and overlapping. Both are dumb in a useful way, which is fast and cheap, but they don't actually understand the text.

There's one approach that does: semantic chunking. Semantic chunking lets embeddings detect where the topic shifts and cut there.

Semantic Chunking, in Plain English

Here's the issue with overlapping chunks: sometimes the overlap still doesn't make sense. If the window happens to cut between "Roman Empire" and "Han Dynasty", you've stitched two unrelated topics into one chunk and called it context. The embedding goes back to being mush.

Semantic chunking fixes this by asking a different question:

"Where does the topic actually change?"

The recipe:

  1. Break the document into sentences
  2. Embed each sentence
  3. Walk through them one by one, comparing each to the previous
  4. When the similarity drops below a threshold → that's a topic shift → start a new chunk

No fixed sizes. No arbitrary overlap. The chunks end up whatever shape the meaning takes.

Working Code with Sentence Transformers

sentence-transformers is a free, open-source library that runs embedding models on your laptop. No API key. No per-token cost.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import re

model = SentenceTransformer('all-MiniLM-L6-v2')

text = (
"The Roman Empire began in 27 BC. "
"Augustus became its first emperor. "
"He reformed the army and built roads across Europe. "
"Meanwhile, in China, the Han dynasty was flourishing. "
"They invented paper and opened the Silk Road. "
"Trade between East and West grew rapidly."
)

# 1. Split into sentences
sentences = re.split(r'(?<=[.!?])\s+', text.strip())

# 2. Embed each sentence
embeddings = model.encode(sentences)

# 3. Break on topic shifts
threshold = 0.5
chunks = []
current = [sentences[0]]

for i in range(1, len(sentences)):
sim = cosine_similarity([embeddings[i-1]], [embeddings[i]])[0][0]
if sim < threshold:
chunks.append(" ".join(current))
current = [sentences[i]]
else:
current.append(sentences[i])
chunks.append(" ".join(current))

for i, c in enumerate(chunks, 1):
print(f"Chunk {i}: {c}\n")

Run it, and you get something like the following:

Chunk 1: The Roman Empire began in 27 BC. Augustus became its first emperor. He reformed the army and built roads across Europe.

Chunk 2: Meanwhile, in China, the Han dynasty was flourishing. They invented paper and opened the Silk Road. Trade between East and West grew rapidly.

Two clean chunks. One per topic. No mid-thought cuts. No token count to babysit.

Tip: 0.5 is a starting threshold. Tighter (0.6+) gives smaller, more focused chunks. Looser (0.3) gives fewer, bigger ones. Tune it on your data.

Model Choice

You can run semantic chunking with a local model (like the one all-MiniLM-L6-v2 above or free and running fine on a laptop) or with a paid API (OpenAI's text-embedding-3-large, Cohere etc.).

The better-paid models catch slightly subtler topic shifts. But remember, semantic chunking embeds every sentence of every document just to figure out where to split. A small RAG with 10,000 documents averaging 100 sentences each = 1,000,000 API calls before you've embedded a single query.

Rule of thumb: start with a local sentence-transformer for chunking. Save paid embeddings for the actual indexing step where they earn their keep.

Which Strategy, When?

A quick guide based on what you're working with:

  1. Internal wiki, FAQs, blog content: Recursive splitting is fine. Don't overthink it.
  2. Dense technical docs, research papers, and legal text: Semantic chunking. Topics shift sharply; you want clean cuts.
  3. Books, podcast transcripts, mixed-topic content: Semantic chunking shines. Overlapping would smear topics together.
  4. Huge knowledge base, retrieval still feels off: Try a paid embedding model for the chunking step — but only after measuring whether the local one is actually the bottleneck.
Start cheap. Measure. Upgrade only if results demand it.

Common Interview Questions on Chunking

If you're prepping for an LLM or RAG interview, these come up often:

Q: Why not just use the biggest possible chunk size?

Big chunks dilute the embedding. The vector becomes an average of too many ideas, and retrieval gets noisy.

Q: What's the real tradeoff with semantic chunking?

Quality goes up, but cost and time go up, too. You're embedding the whole document just to split it before you ever embed it for storage.

Q: When does overlapping chunking fail?

When adjacent topics are unrelated. The overlap glues two ideas together into one chunk, and the embedding loses focus.

Q: How do you pick the similarity threshold?

No universal answer. Run it on your real docs, eyeball the chunks, and adjust until topic shifts feel natural. Usually somewhere between 0.3 and 0.7.

Q: Can you mix chunking strategies in one pipeline?

Yes, and good systems do:

  1. recursive splitting for clean, structured docs
  2. semantics for the messy ones

in the same pipeline.

Notes:

Why does chunking actually improve relevancy? It groups related sentences so each embedding represents one idea. Sharper meaning = sharper retrieval.

Local vs. paid embedding models for chunking? Start local with sentence-transformers. It's free and good enough for most documents. Paid APIs catch subtler topic shifts but the per-sentence cost adds up fast.

When should I move from overlapping to semantic chunking? When your documents jump between unrelated topics and overlap, they start to weld them together into one noisy chunk.

Is the paid model ever worth it for chunking? Rarely. Only after you've measured that the local model is genuinely missing topic shifts that hurt retrieval. For most projects, it isn't worth it.

Day 4 Takeaway

Semantic chunking lets meaning decide where to split, not a token counter. Sentence-transformers makes it free and local — start there, and only reach for paid embedding APIs if your retrieval results actually demand it.

Coming Up on Day 5

Now we actually get to PDF processing, which is the messy, real-world step where raw documents become clean text ready for chunking. Tables, headers, footers, scanned images, weird encodings... all the things textbooks skip.

See you on Day 5.