Day 5: Sliding Chunks, Token Costs & Processing Real PDFs
Day 5: Sliding Chunks, Token Costs & Processing Real PDFs

Before we get into PDF processing, two more chunking strategies are worth meeting first. Those are sliding chunking and token-based chunking. And then we will finally crack open a real PDF.
This blog post is a daily learning summary of my 40-day RAG class from Syed Jaffer of Parotta Salna.
Terms Used Today
- Sliding window: A fixed-size chunk that slides forward by a smaller step, so each chunk overlaps the previous one.
- Step size (stride): How far the window moves forward each time.
- Token-based chunking: Sizing chunks by tokens (what the LLM actually counts) instead of characters or words.
- PDFLoader: A LangChain wrapper that opens a PDF and returns its text page by page.
1. Sliding Chunking:
Sliding chunking is the close cousin of overlapping chunking from Day 3. We pick a window size and a step size. If the step is smaller than the window, chunks overlap. If they are equal, we are back to plain fixed-size with no overlap.
For example, a 500-token window with a 100-token step produces chunks at positions 0–500, 100–600, 200–700, and so on. Four hundred tokens are shared between every two chunks — a safety net for context near boundaries.
The catch is cost. Heavy overlap means we are storing the same tokens four or five times over. More embeddings, more storage in the vector DB, more money on paid APIs. Use sliding chunking when accuracy matters more than budget (legal, medical, and dense technical text). Avoid it when scaling matters.
2. Token-Based Chunking:
Token-based chunking is less of a separate strategy and more of a discipline. It says, 'Measure chunk size in tokens, not characters or words, because tokens are what LLMs and embedding models actually count.'
Why does it matter? Two reasons.
- Cost. Paid APIs charge per token. A "1000-character" chunk could be 250 tokens or 700 tokens, depending on the content. Token-based sizing keeps the bill predictable.
- Context windows. Every LLM has a hard token cap. To fit a chunk, the question, and the system prompt inside that window, we have to measure in tokens.
We usually combine token-based sizing with one of the other strategies. The right phrase is "recursive splitting with 500-token chunks and 50-token overlap", not "token-based chunking instead of overlapping".
3. A Quick Word on the Vector DB:
We chunk the text, embed each chunk into a vector, and store it in a vector database (Chroma, FAISS, Pinecone, or pgvector). Each vector is stored alongside metadata, including the original chunk text, the source document, the page number, and a chunk ID. At query time, the user's question is embedded as well, and the vector DB returns the top-K nearest chunks.
Two things to keep in mind. More overlap means more chunks, which means a bigger and slower index. And the chunk size affects retrieval precision — too small loses context; too big dilutes meaning. The 500-token sweet spot exists for a reason. We will give vector DBs and retrieval their own days later, but this is enough to know what we are building towards.
4. PDF Processing — The Messy Real-World Step:
PDFs were designed to look right when printed, not to be machine-readable. Headers, footers, columns, and tables all get jumbled when we extract raw text. And if the PDF is scanned, there is no text at all — just images of text that need OCR first.
For most clean-text PDFs, LangChain's PyPDFLoader is the easiest start.
Each chunk carries both the text and the source metadata (page number, file path), which flows into the vector DB, allowing us to cite results back to the original document.
When PyPDFLoader is not enough, there are specialised libraries worth knowing camelot for extracting tables into pandas DataFrames, pdfplumber for fine-grained layout control, unstructured for messy mixed formats, and Tesseract for OCR on scanned documents. We do not need to memorise these. We just need to know they exist and reach for them only when the default loader mangles the text.

5. Summing It Up:
If we remember one thing from today, it is this: chunking is a trade-off between context, precision, and cost. Sliding gives us safety nets at the cost of token bloat. Token-based sizing keeps the bill predictable. And PDFs, the messiest source of all, are usually a PyPDFLoader plus a careful look at the output, with camelot, OCR, or pdfplumber waiting in reserve for the harder cases.
Coming Up on Day 6
Four days of chunking and one day of PDF extraction is enough preparation. From tomorrow, we move from preparing the data to actually retrieving it. We will look at how a query travels through the system, how the vector DB finds the top-K most relevant chunks, and the small tricks (re-ranking, hybrid search) that decide whether retrieval feels magical or mediocre.
That's all for today. Let's meet up again tomorrow with Day 6.
Thanks for reading.
Cheers!