One Small Layer That Makes Karpathy's LLM Knowledge Base Way Cheaper to Query

Disclaimer: I’m not an ML engineer or AI researcher. I’m just a guy who builds things and tries to make his tools work better. Take everything here with that in mind. This is what worked for me, not a rigorous benchmark.

@karpathy recently shared how he builds personal knowledge bases with LLMs. The pattern is simple: dump raw data into a raw/ folder, have an LLM compile it into a markdown wiki, then query it in Obsidian. He mentioned he didn’t need “fancy RAG” because the LLM handles index files well at small scale.

And then he dropped this at the end:

I vibe coded a small and naive search engine over the wiki, which I both use directly, but more often I want to hand it off to an LLM via CLI as a tool for larger queries.

That throwaway line is the whole game once your wiki grows. Here’s why, and how I built it.

The Problem That Shows Up Around 200 Files

I’ve been running this exact pattern. YouTube channels, articles, all distilled into markdown summaries by Claude Code. It works great until you hit around 200 files.

My first attempt at solving the search problem was a VOCABULARY.md file — a hand-maintained list of canonical terms and their synonyms. The idea was: before grepping, Claude reads the vocabulary, expands the query, then searches for all variants at once.

So asking “what are the best ways to grow a SaaS?” used to look like this:

Claude reads INDEX.md
→ reads VOCABULARY.md
  → "saas" → also search "subscription software", "software business"
  → "growth" → also search "scaling", "acquire users", "traction"
→ greps summaries/ for all expanded terms
→ finds 20 matching files
→ reads each one
→ by file 12, early context is being compressed
→ answer quality degrades

The vocabulary trick helped with recall. It stopped Claude from missing files when a speaker used different words. But it didn’t solve the real problem. You still ended up reading 15–20 files per question, because grep doesn’t rank. It just returns everything that matches.

The bottleneck isn’t missed synonyms. It’s context window pressure. Reading 15 summary files burns 30–50k tokens per question. On a 224-file wiki, that adds up fast.

There was a second problem too: the vocabulary file kept growing into a synonym dictionary, which is exactly what a semantic embedding model is already good at. Maintaining "growth" → "scaling, acquire users, traction" by hand is pointless once you have a model that understands those are the same concept.

The Fix: Two Small Scripts

One builds a local SQLite vector store from your summaries. The other is a CLI Claude calls before reading anything.

The query loop becomes:

Claude calls: python3 scripts/rag_search.py "best ways to grow a saas"
→ gets back 5 ranked file paths in a few seconds
→ reads only those 5 files
→ answers

Around 6 tool calls instead of 20+. Token cost drops by about 80% per query.

How It Works

summaries/              ← your LLM-compiled wiki
rag.db                  ← SQLite vector store (gitignored, ~3MB)
scripts/
  ├── rag_index.py      ← builds the vector store
  └── rag_search.py     ← CLI search tool

The model is all-MiniLM-L6-v2 — about 90MB, runs fully offline on CPU, and downloads once via sentence-transformers. No API, no cloud, no ongoing cost.

rag_index.py

It walks summaries/, splits each file into chunks by section headers so each concept gets its own embedding, then stores them in SQLite.

#!/usr/bin/env python3
"""
Index all summaries/ into a local SQLite vector store.
Run once, then re-run after adding new summaries.

  pip install sentence-transformers numpy
  python3 scripts/rag_index.py
"""
import os, re, sqlite3
import numpy as np
from sentence_transformers import SentenceTransformer

ROOT = os.path.join(os.path.dirname(__file__), "..")
DB_PATH = os.path.join(ROOT, "rag.db")
SUMMARIES_DIR = os.path.join(ROOT, "summaries")
MODEL_NAME = "all-MiniLM-L6-v2"

def strip_frontmatter(text):
    return re.sub(r"^---\n.*?\n---\n?", "", text, flags=re.DOTALL)

def chunk_markdown(text):
    text = strip_frontmatter(text)
    parts = re.split(r"\n(?=#{2,3} )", text.strip())
    return [p.strip() for p in parts if len(p.strip()) > 50]

def build_index():
    model = SentenceTransformer(MODEL_NAME)
    conn = sqlite3.connect(DB_PATH)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS chunks (
            id INTEGER PRIMARY KEY,
            path TEXT NOT NULL,
            chunk TEXT NOT NULL,
            embedding BLOB NOT NULL
        )
    """)
    conn.execute("DELETE FROM chunks")

    files = sorted([
        os.path.join(r, f)
        for r, _, fs in os.walk(SUMMARIES_DIR)
        for f in fs if f.endswith(".md")
    ])
    print(f"Indexing {len(files)} files...")

    for i, path in enumerate(files):
        with open(path, encoding="utf-8") as f:
            text = f.read()
        chunks = chunk_markdown(text)
        if not chunks:
            continue
        embeddings = model.encode(chunks, show_progress_bar=False)
        conn.executemany(
            "INSERT INTO chunks (path, chunk, embedding) VALUES (?,?,?)",
            [(path, c, e.astype(np.float32).tobytes()) for c, e in zip(chunks, embeddings)]
        )
        if (i + 1) % 20 == 0:
            print(f"  {i + 1}/{len(files)}")

    conn.commit()
    total = conn.execute("SELECT COUNT(*) FROM chunks").fetchone()[0]
    conn.close()
    print(f"Done. {len(files)} files → {total} chunks → rag.db")

if __name__ == "__main__":
    build_index()

rag_search.py

It embeds the query, runs cosine similarity against all chunks, and returns unique file paths ranked by best match.

The search works in three steps. First, your query gets turned into a vector, the same way every chunk was vectorized at index time. Then cosine similarity is computed between your query vector and every stored chunk — think of it as measuring the angle between two points in space: the closer they point in the same direction, the higher the score. Finally, each file keeps only its best-scoring chunk, and the files are sorted by that score. What you get back is a ranked list of files where at least one section closely matched what you asked.

#!/usr/bin/env python3
"""
Semantic search over the wiki.

  python3 scripts/rag_search.py "best ways to grow a saas in 2026"
  python3 scripts/rag_search.py "fibonacci retracement trading" --top 8
"""
import argparse, os, sys, sqlite3
import numpy as np
from sentence_transformers import SentenceTransformer

ROOT = os.path.join(os.path.dirname(__file__), "..")
DB_PATH = os.path.join(ROOT, "rag.db")
MODEL_NAME = "all-MiniLM-L6-v2"

def cosine(a, b):
    d = np.linalg.norm(a) * np.linalg.norm(b)
    return float(np.dot(a, b) / d) if d > 1e-10 else 0.0

def search(query, top_n=5):
    if not os.path.exists(DB_PATH):
        print("ERROR: rag.db not found. Run: python3 scripts/rag_index.py", file=sys.stderr)
        sys.exit(1)

    model = SentenceTransformer(MODEL_NAME)
    q = model.encode([query])[0].astype(np.float32)

    conn = sqlite3.connect(DB_PATH)
    rows = conn.execute("SELECT path, embedding FROM chunks").fetchall()
    conn.close()

    best = {}
    for path, blob in rows:
        emb = np.frombuffer(blob, dtype=np.float32)
        score = cosine(q, emb)
        if path not in best or score > best[path]:
            best[path] = score

    for path, score in sorted(best.items(), key=lambda x: -x[1])[:top_n]:
        print(f"{score:.3f}  {os.path.relpath(path, ROOT)}")

if __name__ == "__main__":
    p = argparse.ArgumentParser()
    p.add_argument("query", nargs="+")
    p.add_argument("--top", type=int, default=5)
    args = p.parse_args()
    search(" ".join(args.query), args.top)

The CLAUDE.md Change: The Part That Actually Makes It Work

The scripts do nothing if your LLM doesn’t know to use them. You have to update the instruction layer.

Before:

## Query Strategy

When asked a question:
1. Read INDEX.md first — identify relevant channels/topics
2. Read VOCABULARY.md — expand query terms to include all synonyms
3. Use Grep with expanded terms — search canonical term AND all synonyms
4. Read matched files
5. Synthesize

The vocabulary step was a manual workaround for the fact that grep doesn’t understand meaning. You had to pre-list every possible way someone might say a thing. It helped, but you still ended up reading 15–20 files because everything that matched got pulled in. Grep doesn’t rank, it just returns.

After:

## Query Strategy

When asked a question:
1. Run the RAG search first:
   python3 scripts/rag_search.py "your query here" --top 7
   If rag.db is missing, run rag_index.py once to build it.
2. Read ONLY the returned files — do not scan all of summaries/
3. Only read raw content files if a direct quote is needed
4. Synthesize

Only consult VOCABULARY.md if the search returns no results at all
and you suspect the query uses an opaque term (platform acronym,
trading abbreviation, coined phrase the embedding model won't know).

VOCABULARY.md still exists, but its scope shrank dramatically. It no longer holds synonym lists — the semantic model handles those automatically. It only contains terms that are genuinely opaque: things like KU for Kindle Unlimited, BSR for bestseller rank, ATR for average true range. Terms a reasonable person wouldn’t recognize, and that the embedding model can’t connect to their meaning because they’re too new, too niche, or invented. The trigger for consulting it is also narrower: not “thin results” but zero results. If the search finds anything, you trust the semantic ranking and read those files.

The critical line is step 2: do not scan all of summaries/. Without that explicit constraint, Claude will grep broadly anyway and defeat the whole point. LLMs need explicit rules, not just hints.

Real Results on My Wiki

224 summary files, tested cold:

Query: "best ways to grow a saas in 2026"

0.711  summaries/youtube/starterstory/2025-08-02_how-i-built-a-10k-month-saas-beginner-strategy.md
0.582  summaries/youtube/starterstory/2026-01-21_how-i-grew-my-saas-to-150k-year-with-reddit-and-seo.md
0.580  summaries/youtube/starterstory/2025-08-23_how-i-finally-built-a-10k-month-saas-30-failures.md
0.546  summaries/youtube/starterstory/2025-12-02_how-i-built-it-30k-month-micro-saas-subscribr-breakdown.md
0.522  summaries/youtube/starterstory/2023-03-27_he-built-a-600-000-one-person-business-with-video-editing.md

Query: "building recurring revenue online"

0.555  summaries/youtube/GregIsenberg/2025-07-28_making-with-microsaas-i-might-delete-this.md
0.498  summaries/youtube/danielbarada/2026-01-16_how-to-start-a-one-person-business-in-8-hours-starting-with.md
0.471  summaries/youtube/starterstory/2024-11-22_how-i-built-it-12k-month-micro-saas.md

The second query has zero exact keyword matches anywhere — not in the filenames, not in the file content. “Building”, “recurring”, “revenue”, “online” appear in none of them. It found the right files because it understood what I was asking, not because it matched the words.

Why Not Just Use Keyword Search (FTS)?

Keyword search (BM25/FTS) does well on an LLM-generated wiki because the vocabulary is consistent. But the edge cases compound:

“building an audience” → FTS misses files about “community growth” or “follower acquisition”
“passive income” → FTS misses “recurring revenue” and “monetization strategy”
“staying motivated” → FTS misses “discipline systems”, “habit stacking”, “consistency”

The semantic model understands these as the same concept. On a knowledge base that covers varied topics, those misses matter.

Maintenance

When to re-index: after any batch of new summaries.

python3 scripts/rag_index.py
# 224 files → 1012 chunks in ~30 seconds on CPU

What gets indexed: only summaries/, not raw transcripts. Summaries are dense and consistently structured, which gives the model cleaner signal.

Storage: rag.db is about 4MB for 224 files. Add *.db to .gitignore — it’s a derived artifact, always rebuildable from your summaries.

The Bigger Picture

Karpathy’s architecture is solid: raw/ → LLM wiki → query in Obsidian. The semantic search layer isn’t a replacement for good index files. It’s a retrieval stage that runs before Claude reads anything, so it only reads what’s relevant.

Without search:
  grep broadly → 20 files → read all → ~40k tokens → answer

With search:
  rag_search.py → 5 files → read those → ~5k tokens → answer

As the wiki grows to 500, 1000 files, the grep approach breaks. The search layer keeps query cost flat regardless of size.

Again, I’m not an expert. I vibe coded this to reduce token costs on my own setup, not to build the perfect RAG pipeline. There are probably smarter chunking strategies, better models, hybrid approaches that combine semantic and keyword search. I don’t know what I don’t know. If you’ve done something similar and have thoughts on how to improve this, I’d genuinely love to hear it. Drop a reply or a DM. I’m here to learn.