AI search
Retrieval-augmented generation for the working schlub
2024-02-06 — 2026-04-06
Wherein whole-document cosine similarity and paragraph-level chunked retrieval are each brought to bear upon a small corpus, with learned vector embeddings substituted for classical term-frequency metrics.
This page collects notes on AI search over a bounded corpus (i.e. something small like this blog, not big like the internet).
The first is a whole-document similarity index (“related posts”), the second is a chunked retrieval search (“find where I wrote about X”). Both rely on vector embeddings but they make very different architectural choices.
The underlying problem is the same one that classical information retrieval solves — find documents similar to a query — but with learned embeddings replacing hand-crafted metrics like TF-IDF and BM25.
BM25 (“Best Matching 25”) is a term-frequency scoring function from the 1990s that improves on raw TF-IDF by adding document length normalisation and diminishing returns for repeated terms. It’s the default ranking algorithm in Lucene, Elasticsearch, and most traditional search engines. BM25 is fast and surprisingly hard to beat for keyword queries — if someone searches for “particle filter” and a document contains those exact words, BM25 will find it reliably. Where it falls down is semantic similarity: it can’t know that “sequential Monte Carlo” means the same thing, because it only counts words, it doesn’t understand them.
The old tools (Lucene, Xapian, Sphinx) induce a vector space from term frequencies; the new ones induce it from neural networks. The geometry is the same (cosine similarity over high-dimensional vectors), but the learned metrics capture semantic similarity rather than lexical overlap, which is a dramatic improvement in practice. BM25 is still useful as a complementary signal — especially for exact keyword matches — which is why QMD combines both.
For background on vector databases and embeddings see those respective pages.
1 Example 1: In-memory similarity search
Those “suspiciously similar posts” links at the top of the page are generated by a kernel similarity search over the whole corpus, implemented as a plain matrix multiply in numpy. This is a naïve but effective approach, and instructive because it shows how simple the core operation is before we add infrastructure.
1.1 How it works
Each post is embedded as a single vector (1024 dimensions via mxbai-embed-large, run locally through Ollama). The embedding model sees only the first 512 tokens of each post — roughly the title, categories, and a paragraph or two. That’s the entire “understanding” of the post, which is surprisingly sufficient for finding related content.
The similarity computation is a dot product of L2-normalised vectors, which gives cosine similarity:
That’s it. Q is the query matrix (the posts we want neighbours for), E is the full document embedding matrix, and numpy handles the rest. For ~1700 posts with 1024-dimensional embeddings the matrix is about 7 MB. Top-k selection uses np.argpartition, which is \(\mathcal{O}(N)\) per query rather than \(\mathcal{O}(N \log N)\) for a full sort — but at this scale even a full sort would be fast.
The embeddings live in a compressed .npz file (stored as float16 on disk). Per-document metadata (title, content hash, categories). Incremental updates use a blake2s hash of the truncated text, so unchanged posts are skipped on re-index.
The output is one JSON file per page in related/, which the client-side template fetches to render the “suspiciously similar” links.
1.2 Embedding model choice
I auditioned two models: nomic-ai/nomic-embed-text-v1.5 (8192 token context) and mixedbread-ai/mxbai-embed-large-v1 (512 token context). Counter-intuitively, mxbai gave more intuitively correct results despite seeing far less of each post. I suspect this is because for similarity (as opposed to retrieval), the title and opening paragraph carry most of the topical signal, and a shorter context forces the model to focus on that signal rather than diluting it with the body text.
I’m curious about the SPECTER2 embeddings, which apparently produce good embeddings for science, but the API is rather different so I didn’t hot-swap it in for testing.
1.3 Why this works at blog scale
The entire approach — embed everything, dump to a flat numpy array, brute-force cosine distance — is viable because ~2000 documents is tiny. There’s no approximate nearest neighbour index, no vector database, no sharding. The full pairwise similarity matrix is 2000×2000, which fits in L2 cache.
The script is open source. You can download it from similar_posts_static_site.py.
However the version that runs this site has many improvements I did not include there, sorry. Nag me on github if you want the latest version.
2 Example 2: Local semantic search with QMD
The similarity index described above answers “what posts are related to this one?” but it doesn’t answer “where did I write about topic X?” — a retrieval problem rather than a similarity problem. These are architecturally different: similarity works best with whole-document embeddings (as I do with mxbai-embed-large), while retrieval benefits from sub-document chunking so we can find the right passage inside a long post.
QMD (by Tobi Lütke) is a local CLI search engine that combines BM25 full-text search, vector semantic search, and LLM reranking — all running locally via node-llama-cpp with GGUF models. It brings its own embedding model rather than reusing the Ollama mxbai-embed-large I use for similarity, which is arguably correct: different models for different tasks.
2.1 Installation
Or run without installing via npx @tobilu/qmd.
2.2 Indexing the blog
Register the blog content directories as a collection and generate embeddings. Note the explicit --mask flag — QMD defaults to **/*.md and silently ignores unrecognised flags, so without --mask it will only find plain markdown files. Restricting to **/*.qmd ensures only Quarto source content is indexed.
The embed step runs a local GGUF embedding model via llama.cpp. It chunks documents at paragraph boundaries by default (the paragraph chunking strategy), which is the right granularity for prose blog posts. For source code we could use the ast strategy which does AST-aware chunking, but that’s not relevant here.
2.3 Search
QMD exposes three search tiers:
The hybrid query mode is usually what I want. We can scope to the blog collection with -c blog, get JSON output with --json, or retrieve the full document body of a result with qmd get "#docid".
2.4 Opening results in VS Code
As of the latest trunk build, QMD emits clickable OSC 8 terminal hyperlinks in its search results. In a modern terminal (iTerm2, Kitty, WezTerm, Ghostty), each result path is a clickable link that opens the file at the matching line in your editor — like <a href> in HTML, but in the terminal. The URI scheme is configurable, so it works with VS Code (vscode://file/), Cursor (cursor://file/), Zed, etc.
For older terminals or cases where OSC 8 isn’t supported, a fish shell wrapper still works:
As of early 2026, there is AFAICT no good VS Code extension for semantic search over prose markdown. The extensions that exist — Zilliz Semantic Code Search, sturdy-dev/semantic-code-search — are designed around code structure (AST-based chunking of functions and classes). They list Markdown as a supported filetype, but they don’t chunk prose at paragraph boundaries or handle frontmatter, math blocks, or callouts.
The markdown knowledge base extensions (Foam, Markdown Memo) handle wikilinks and graph visualisation but don’t do vector search at all.
So for now, a CLI tool + shell wrapper is the state of the art for “semantic search my 2000 prose files and open the result in my editor.”
2.5 MCP server for Claude integration
QMD also runs as an MCP server, which means it can be wired into Claude Desktop, Claude Code, or Cursor. It exposes query, get, multi_get, and status as MCP tools, so I can ask Claude “search my blog for posts about kernel methods” and it will use QMD behind the scenes.
Claude Code — register globally with:
Or scope it to the blog repo by adding a .mcp.json to the repo root:
Claude Desktop — add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS), merging into the existing mcpServers object.
2.6 Architecture notes
The key design decisions that make QMD a good fit here:
- Chunking is paragraph-level by default, which is the right grain for retrieval over prose. My existing
index-blogpipeline embeds whole documents because it’s solving a different problem (post-level similarity). - BM25 + vectors + reranking is the standard three-stage retrieval pipeline from the information retrieval literature. Having all three locally is nice.
- SQLite-backed — no separate vector database server to manage. My existing similarity pipeline uses numpy
.npzfiles and per-doc JSON, which is a different approach but shares the same virtue: everything is local files, no infrastructure. - MCP and CLI from one tool — no need to run two separate services to get both terminal and Claude access.
- Separate from the similarity pipeline — the two systems serve different purposes with different optimal chunking strategies, so keeping them independent is the right call even though it means two embedding models in memory.
2.7 Alternative: txtai for composable pipelines
QMD is opinionated — it bundles its own embedding model, chunker, BM25 index, and reranker into one tool. That’s a strength when we want to get running quickly, but a limitation when its choices don’t suit the corpus.
txtai takes the opposite approach: it’s a Python framework where each stage of the pipeline — chunking, embedding, indexing, retrieval, reranking — is a swappable component. We could plug in Ollama with mxbai-embed-large (reusing the same model as the similarity pipeline), write a custom chunker that understands .qmd frontmatter and fenced code blocks, and choose our own ANN backend. It also exposes a REST API, so wrapping it as an MCP server would be straightforward.
The tradeoff is assembly time: QMD took five minutes to set up; a txtai pipeline would take an afternoon. But if QMD’s regex paragraph chunker starts mangling our math blocks or YAML frontmatter, having the option to swap in something that understands our document format is worth knowing about.
3 When we outgrow brute force
Both examples above work because the corpus is small — ~2000 documents, a few thousand chunks. At that scale the choice of infrastructure barely matters; everything fits in memory and every query is fast. Here’s what changes as we scale up, and how the RAG pipeline decomposes into separately-scalable stages.
3.1 Decomposing the retrieval pipeline
A vector search system has three stages:
- Chunking — split documents into passages. For retrieval we want sub-document chunks (paragraphs, sections); for similarity we might embed whole documents. The right chunk size depends on the task, not the corpus size.
- Embedding — map each chunk to a vector. This is a one-time batch job (plus incremental updates). The cost scales linearly with corpus size and is dominated by model inference time. My blog takes a few minutes to embed from scratch on CPU via Ollama on a laptop GPU; at 1M+ documents we’d want an embedding API or distributed GPU inference.
- Indexing and retrieval — given a query vector, find the nearest chunks. This is where scale matters most. At 2000 documents, brute-force cosine distance is fine. At 100k+ documents, we need an approximate nearest neighbor (ANN) index — see vector databases for the options and tradeoffs.
Both examples on this page stop here — they return documents or passages, which is all we need for “related posts” or “find where I wrote about X”.
Retrieval-augmented generation (RAG) adds a fourth stage: feed the retrieved chunks to an LLM as context and generate a synthesised answer. This doesn’t change the retrieval infrastructure (we still only pass the top-k results to the LLM), but it does mean chunk quality directly determines generation quality — garbage in, garbage out. RAG is the standard architecture behind “chat with your documents” products and enterprise search assistants.
3.2 Infrastructure spectrum
At the small end (this blog), the “stack” is:
Ollama → numpy
.npz→Q @ E.T→ JSON files
At the large end (production RAG), we’d see something like:
Embedding API → Vector database (Pinecone, Qdrant, Weaviate) → ANN retrieval → Reranker → LLM
The Algolia search that powers the search box at the top of the page presumably uses similar technology under the hood. However, it’s run by a third party who serves the content from their own servers, so I can’t really speak to what they’re doing behind the scenes.
The middle ground — say, 10k–100k documents — is where tools like QMD, ChromaDB, and LanceDB live. They provide an ANN index and persistence without requiring us to run a separate database server.
If you’re wondering whether you need a vector database: if your corpus fits in a single numpy array and queries take less than a second, you don’t. The infrastructure is there to solve problems you might not have yet. See vector databases for when and why you might.
4 Interesting embeddings
4.1 Generic text embeddings
- Nomic embeddings: Introducing Nomic Embed: A Truly Open Embedding Model are small, fast, and open. I trialled them for this blog and they were OK but not amazing, even though they have a big context window.
- mxbai embeddings: Open Source Strikes Bread - New Fluffy Embedding Model | Mixedbread These embeddings are generated by a relatively large model that uses a relatively small number of tokens. Counter-intuitively, they were great at classifying the text of this blog, even though they only look at the first 512 tokens.
4.2 Specialised for scientific text
SPECTER2: Adapting scientific document embeddings to multiple fields and task formats:
Models like SPECTER and SciNCL are adept at embedding scientific documents as they are specifically trained so that papers close to each other in the citation network are close in the embedding space as well. For each of these models, the input paper text is represented by a combination of its title and abstract. SPECTER, released in 2020, supplies embeddings for a variety of our offerings at Semantic Scholar - user research feeds, author name disambiguation, paper clustering, and many more! Along with SPECTER, we also released SciDocs - a benchmark of 7 tasks for evaluating the efficacy of scientific document embeddings. SciNCL, which came out last year, improved upon SPECTER by relying on nearest-neighbour sampling rather than hard citation links to generate training examples.
This model and its ilk are truly targeted at research discovery, and they are so good at it that we might argue they have “solved” the knowledge topology problem for scientific papers.
I implemented a search engine for ICLR 2025 using the SPECTER2 embeddings and I was impressed with the quality of the results. Note the API is a little different from the default huggingface API used by mxbai et al.; we need to use the “adapters” library.
5 Tools
ChromaDB is a vector database with a focus on search and retrieval. I used it to store vector embeddings to note “similar posts” on this site and I can report it was incredibly simple for my use case, and scales well to thousands of documents at least. It’s based on sqlite.
6 Internet search
6.1 Commercial services searching the internet
See internet search.
6.2 Free/FOSS-ish internet search
Like perplexity, but locally.
- nilsherzig/LLocalSearch: LLocalSearch is a completely locally running search aggregator using LLM Agents. The user can ask a question and the system will use a chain of LLMs to find the answer. The user can see the progress of the agents and the final answer. No OpenAI or Google API keys are needed.
- nashsu/FreeAskInternet: FreeAskInternet is a completely free, PRIVATE and LOCALLY running search aggregator & answer generator using MULTI LLMs, without GPU needed. The user can ask a question and the system will make a multi-engine search and combine the search result to LLM and generate the answer based on search results. It’s all FREE to use..
6.3 Web scraping for RAG pipelines
Firecrawl is a web scraping and content extraction service that converts web pages into clean markdown — the format you’d want for feeding into a RAG pipeline or building a search index over web content. It exposes an MCP server (docs) with tools for scraping single pages, batch processing, site crawling, and web search, so an LLM agent can pull in web content as context. Supports both cloud and self-hosted deployment.
This is complementary to local search tools like QMD — Firecrawl gets the content off the web; QMD (or txtai, or your own pipeline) indexes it locally.
