Day 8: Sparse Embeddings in RAG – Understanding Token-Based Semantic Retrieval

Day 7 on Dense Embedding — Capturing Semantic Meaning with Vector Representations spent the whole class on the dense half of the embedding world. Today flips to the other side. Sparse embeddings are older, simpler, and still the workhorse behind a huge amount of search in the real world. We will walk through what "sparse" really means, build the smallest possible sparse vector by hand, and then code up Term Frequency from scratch in two flavours, a toy sentence and a real movie dataset.

This blog post is a daily learning summary of my 40 Day RAG class from Syed Jaffer of Parotta Salna.

Terms Used Today

Sparse Embedding: A long vector where almost every position is zero. Each position belongs to one word from a fixed vocabulary.

Vocabulary: The dictionary of words our system knows about. Each word gets a fixed position (index) in every vector.

One-Hot / Binary Encoding: The simplest sparse vector. 1 if the word is present in the text, 0 otherwise. Ignores how many times it appeared.

Token: A single unit of text after splitting. Usually one word, sometimes a sub-word.

Tokenisation: Breaking text into tokens. The first step before counting anything.

Term Frequency (TF): How often a term appears in a document, divided by the total number of terms in that document. A normalised count.

Stopwords: Very common words like the, is, of that carry almost no meaning. Often removed before counting.

Counter: A Python helper from collections that counts how many times each item appears in a list.

Direct Text Matching: Search that only fires when the exact words appear. The classic strength of sparse methods.

1. From Dense Back to Sparse:

Day 7 on Dense Embedding — Capturing Semantic Meaning with Vector Representations closed with the rule that dense embeddings are powerful but not precise. They cluster weight loss with fat loss even when the words are different, which is great for meaning, less great when the user asks for an exact part number.

Sparse embeddings sit on the opposite side. They are precise but not powerful. If the exact token appears, they light up. If it does not, they stay silent. That is exactly the property we want for ISBNs, error codes, SKUs, and any place where "close in meaning" is the wrong answer. Today's class went down one step further. What is the smallest sparse embedding I can build by hand, and where does Term Frequency fit in?

2. What "Sparse" Actually Means:

A small definition first. A sparse embedding is a long vector where almost every value is zero. The vector has one position for every word in our vocabulary, and most positions are empty because most words do not appear in any given sentence.

The simplest way to picture it. Imagine our entire vocabulary is just five words.

vocabulary = {"python": 0, "redis": 1, "fastapi": 2, "docker": 3, "database": 4}

That dictionary is the index map. Every word gets a fixed slot. Position 0 is reserved for python forever, position 1 for redis, and so on. Now take a sentence.

"python fastapi application. fastapi is a nice framework."

To turn it into a sparse vector, we walk through the tokens and mark a 1 in the slot of every word we recognise.

vocabulary = {"python": 0, "redis": 1, "fastapi": 2, "docker": 3, "database": 4}
sentence = "python fastapi application. fastapi is a nice framework."

tokens = sentence.lower().split()
vector = [0] * len(vocabulary)

for token in tokens:
    if token in vocabulary:
        vector[vocabulary[token]] = 1

print("Sparse Vector:", vector)
# Sparse Vector: [1, 0, 1, 0, 0]

That five-long vector is a sparse embedding. The 1 at position 0 says python is here, the 1 at position 2 says fastapi is here, the zeros say the rest of our vocabulary is not. Words like application, nice, framework are completely ignored because they are not in the vocabulary. 2-new This style has two common names. One-hot encoding when only one position is set, binary encoding when multiple positions are set but each is either 0 or 1. Both are sparse, both are direct-text-matching, and both throw away one important piece of information. How often a word appeared.

3. Counting Instead of Just Marking:

The binary vector says fastapi is present. It does not say fastapi appeared twice. For many search tasks, that second piece matters a lot. A document that mentions fastapi ten times is probably more about FastAPI than one that mentions it once.

Python's Counter does the counting for us in one line.

from collections import Counter

sentence = "redis is fast and redis supports caching"
tokens = sentence.lower().split()
token_counts = Counter(tokens)

print(token_counts)
# Counter({'redis': 2, 'is': 1, 'fast': 1, 'and': 1, 'supports': 1, 'caching': 1})

Now we know redis showed up twice and everything else once. This is the bridge between one-hot encoding and Term Frequency. We have gone from "is it there?" to "how many times is it there?".

But raw counts have a problem. A long document will trivially have higher counts than a short one, even if the short one is more focused on the topic. To compare fairly across documents of different sizes, we normalise.

4. Term Frequency, From Scratch:

Term Frequency is the simplest normalisation in the book. Count of the term in this document, divided by the total number of terms in this document.

Screenshot 2026-05-25 at 20.38.16

A small worked example.

from collections import Counter

document = "Redis is an in-memory database database database"
tokens = document.split()
total_terms = len(tokens)

counter = Counter(tokens)
tf_scores = {term: count / total_terms for term, count in counter.items()}

print(tf_scores)
# {'Redis': 0.125, 'is': 0.125, 'an': 0.125,
#  'in-memory': 0.125, 'database': 0.5}

The document is eight tokens long. Database appears four times, so its TF is 4 / 8 = 0.5. The other words appear once each, so each has a TF of 1 / 8 = 0.125. The vector now carries proportion, not just presence.

A small thought to sit with. TF turns a count into a share. Half of this document is the word database. That is a much stronger signal than a raw count of 4 floating without context

5. TF on a Real Dataset:

The notebook in class scaled this exact idea up to an IMDB movie dataset of descriptions. The shape of the code is the same, just wrapped in a small search function. The full notebook is here.

The interesting bits are worth walking through.

import re
from collections import Counter
import pandas as pd

df = pd.read_csv("imdb_processed.csv")
df["Description"] = df["Description"].fillna("")

stopwords = {"the", "a", "an", "and", "or", "of", "to", "in", "on", "for",
             "with", "is", "are", "was", "were", "be", "it", "that", "this",
             "they", "he", "she", "you", "your", "our", ...}  # trimmed

def tokenize(text):
    words = re.findall(r"[a-zA-Z0-9]+", str(text).lower())
    return [w for w in words if w not in stopwords]

doc_tokens = df["Description"].map(tokenize)
doc_lengths = doc_tokens.map(len).replace(0, 1)

tf_docs = []
for tokens, length in zip(doc_tokens, doc_lengths):
    counts = Counter(tokens)
    tf = {term: count / length for term, count in counts.items()}
    tf_docs.append(tf)

What changed from the toy version.

Tokenisation got smarter. Instead of .split(), the class used a regex ([a-zA-Z0-9]+) that pulls out only word-like chunks and drops punctuation. Then it lowercases everything so Database and database count as the same token.
Stopwords got removed. Words like the, is, of appear in almost every movie description and would dominate the TF scores without adding any signal. Dropping them up front cleans the vectors.
Every document got its own TF dictionary. Instead of one fixed vocabulary list, we store a small dictionary per movie, mapping each present term to its TF. This is the practical sparse representation. We never write out the zeros, we just remember the non-zero positions.

The search function then does the smallest possible matching.

def search_movies(query, top_k=5):
    q_tokens = tokenize(query)
    if not q_tokens:
        return []

    q_counts = Counter(q_tokens)
    q_len = len(q_tokens)
    q_tf = {term: count / q_len for term, count in q_counts.items()}

    scores = []
    for idx, doc_tf in enumerate(tf_docs):
        score = 0.0
        for term, q_weight in q_tf.items():
            score += q_weight * doc_tf.get(term, 0.0)
        if score > 0:
            scores.append((score, idx))

    scores.sort(reverse=True)
    return [df.loc[idx, "Movie Name Clean"] for _, idx in scores[:top_k]]

The scoring is a dot product. For each query term, multiply its TF in the query by its TF in the document, sum across all query terms, and that single number is the relevance score. Documents that share more of the query's words, in higher proportions, rank higher. Run a query like "bank heist money" and the list comes back as movies whose descriptions actually use those words. Direct text matching, with weights based on how concentrated those words are in each description.

6. Where Plain TF Falls Short:

TF alone has a blind spot. It treats every word as equally informative. If our dataset is full of movie descriptions, the word movie will appear all over the place. A document mentioning movie ten times scores high on a query for movie, even though that word tells us almost nothing about which film is the right one.

Day 6 on Embeddings — Semantic Similarity, Cosine, and Dense vs Sparse already hinted at the fix. IDF (Inverse Document Frequency). IDF downweights words that appear in many documents and upweights rare ones. A word that shows up in every description is mostly noise, a word that shows up in only five descriptions is probably a strong signal. Multiplying TF by IDF gives us TF-IDF, which is the real workhorse formula behind classical search. BM25 is essentially TF-IDF with a few smoothing tricks so a word repeated 100 times does not score 100× a word repeated once. A small thought to sit with. Sparse retrieval is a chain of small fixes. Binary tells us presence, Counter tells us frequency, TF normalises by length, IDF normalises by rarity, BM25 smooths the curve. Each step patches the weakness of the previous one. We started that chain today.

7. Why This Still Matters in 2026:

Dense embeddings are clearly the headline act in modern RAG, so it is fair to ask why we are still spending a day on TF and sparse vectors. Three reasons.

Exact matches. Dense models cluster iPhone 15 near iPhone 16. Sparse keeps them apart. For part numbers, model IDs, error codes, dates, names, sparse is still the right tool.
Hybrid search. Most production RAG systems combine dense and sparse scores into one ranking. To tune that blend (and to debug it when it goes wrong) we need to understand both sides. We cannot reason about a hybrid we only half-understand.
Interpretability. A sparse score is explainable in plain language. "This document scored high because it contains the words X and Y." A dense score is a number from a neural network with no built-in justification. When something looks off, sparse is much easier to inspect. The code from this section is all on the class GitHub at 40-day-rag-parottasalna/04.embedding/sparse, and it walks through binary, token counts, TF, TF-IDF, BM25, cosine on sparse vectors, and hybrid. Today's post covers the first three files in that folder. The rest we will pick up on the next days.

8. If This Came In An Interview:

What is a sparse embedding? A long vector where almost every value is zero, with one position per word in a fixed vocabulary. A 1 (or a TF score) lights up the positions for words actually present in the text.
How is sparse different from dense? Sparse vectors map directly to vocabulary words and are mostly zero. Dense vectors are mostly non-zero and have no individually interpretable dimensions. Sparse matches exact words, dense matches meaning.
What is one-hot encoding? The simplest sparse representation. A 1 in the slot for each word that appears, 0 everywhere else. Throws away frequency information.
What is Term Frequency? The count of a term in a document divided by the total number of terms in that document. A normalised count that compares fairly across documents of different lengths.
Why divide by document length? A longer document trivially has higher raw counts. Normalising by length turns the count into a proportion, which is comparable across documents.
Why do we remove stopwords before TF? Words like the, is, of appear everywhere and add no signal. Including them inflates the vectors and dilutes the meaningful terms.
Why is TF on its own not enough? It treats all words as equally informative. A word that appears in every document scores just as well as a rare one. IDF fixes this by downweighting common words.
When would you pick sparse over dense in a real system? Exact-match lookups (part numbers, IDs, error codes), interpretable scoring, and as the keyword side of a hybrid pipeline.

9. Summing It Up:

If we remember one thing from today, it is this: a sparse embedding is a long vector with one slot per vocabulary word, mostly zero, that lights up only when its exact words are present. Binary encoding marks presence, Counter records frequency, and Term Frequency turns frequency into a length-normalised proportion. TF alone treats every word as equally important, which is the gap that IDF and BM25 close in the next days. Sparse will never match dense for semantic meaning, but for exact-match retrieval and as one half of a hybrid pipeline, it is still the right tool in 2026.

Coming Up on Day 9 on Sparse Embeddings in RAG – Understanding Token-Based Semantic Retrieval - Part 2

We have built TF from scratch and seen where it stops working. The natural next step is IDF, which fixes the every word weighed the same problem, and then TF-IDF, which combines the two into the formula classical search has used for decades. After that, BM25 and cosine on sparse vectors are the last stops before we get to hybrid retrieval. Tomorrow's post picks up from there.

Resources Worth Bookmarking

For anyone wanting to dig deeper than this post does, the ones I have found useful so far.

Stanford NLP (Manning, Raghavan, Schütze), Introduction to Information Retrieval, free online edition. Chapter 6 is the classic walkthrough of TF, IDF, and TF-IDF, written more carefully than most blog posts.
scikit-learn user guide, "Feature extraction → Text feature extraction". A short, practical tour of CountVectorizer and TfidfVectorizer, with examples that drop straight into a notebook.
Elasticsearch documentation on BM25. A friendly explanation of how the production version of TF-IDF is actually scored and tuned.
The class GitHub at syedjaferk/40-day-rag-parottasalna. Files 01 through 11 march through the entire sparse pipeline, from one-hot up to hybrid OpenSearch. Today's post covers files 01 to 03.

That's all for today. Let's meet up again tomorrow with Day 9.

Thanks for reading.

Cheers!