AI search

Retrieval-augmented generation for the working schlub

2024-02-06 — 2026-05-12

Wherein Two Architecturally Distinct Retrieval Systems Are Described for a Small Blog Corpus, Whole-Document Cosine Similarity Being Contrasted With Chunked Retrieval Combining BM25 and Vector Search.

computers are awful together

faster pussycat

incentive mechanisms

intractable

mind

NLP

provenance

wonk

This page collects notes on AI search over a bounded corpus (i.e. something small like this blog, not big like the internet).

The first is a whole-document similarity index (“related posts”), the second is a chunked retrieval search (“find where I wrote about X”). Both rely on vector embeddings but they make very different architectural choices.

The underlying problem is the same one that classical information retrieval solves — find documents similar to a query — but with learned embeddings replacing hand-crafted metrics like TF-IDF and BM25.

The old tools (Lucene, Xapian, Sphinx) induce a vector space from term frequencies (back in my day it was TF-IDF— term frequency–inverse document frequency); the new ones induce it from neural networks. The geometry is the same (cosine similarity over high-dimensional vectors), but the learned metrics capture semantic similarity rather than lexical overlap, which is a dramatic improvement in practice.

BM25 (“Best Matching 25”) is a term-frequency scoring function from the 1990s that improves on raw TF-IDF by adding document length normalization and diminishing returns for repeated terms. It’s the default ranking algorithm in Lucene, Elasticsearch, and most traditional search engines. BM25 is fast and surprisingly hard to beat for keyword queries — if someone searches for “particle filter” and a document contains those exact words, BM25 will find it reliably. Where it falls down is semantic similarity: it can’t know that “sequential Monte Carlo” means the same thing, because it only counts words, it doesn’t understand them.

BM25 is still useful as a complementary signal — especially for exact keyword matches — which is why QMD combines both.

For background on vector databases and embeddings see those respective pages.

Traveller beware: Most of the text below has been translated from my dev notes by an LLM (i.e. I did it, and then got the LLM to write it up) and I have not reviewed it. Bug reports welcome.

1 Example 1: In-memory similarity search

Those “suspiciously similar posts” links at the top of the page are generated by a kernel similarity search over the whole corpus, implemented as a plain matrix multiply in numpy. This is a naïve but effective approach, and instructive because it shows how simple the core operation is before we add infrastructure.

1.1 How it works

Each post is embedded as a single vector (1024 dimensions via mxbai-embed-large, run locally through Ollama). The embedding model sees only the first 512 tokens of each post — roughly the title, categories, and a paragraph or two. That’s the entire “understanding” of the post, which is surprisingly sufficient for finding related content.

The similarity computation is a dot product of L2-normalised vectors, which gives cosine similarity:

S = Q @ E.T       # [num_queries, num_docs] cosine similarities
D = 1.0 - S       # cosine distance

That’s it. Q is the query matrix (the posts we want neighbours for), E is the full document embedding matrix, and numpy handles the rest. For ~1700 posts with 1024-dimensional embeddings the matrix is about 7 MB. Top-k selection uses np.argpartition, which is \(\mathcal{O}(N)\) per query rather than \(\mathcal{O}(N \log N)\) for a full sort — but at this scale even a full sort would be fast.

The embeddings live in a compressed .npz file (stored as float16 on disk). Per-document metadata (title, content hash, categories). Incremental updates use a blake2s hash of the truncated text, so unchanged posts are skipped on re-index.

The output is one JSON file per page in related/, which the client-side template fetches to render the “suspiciously similar” links.

1.2 Embedding model choice

I auditioned two models: nomic-ai/nomic-embed-text-v1.5 (8192 token context) and mixedbread-ai/mxbai-embed-large-v1 (512 token context). Counter-intuitively, the mxbai felt way more natural, despite seeing far less of each post. I suspect this is because for similarity (as opposed to retrieval), the title and opening paragraph carry most of the topical signal, and a shorter context forces the model to focus on that signal rather than diluting it with the body text.

I’m curious about the SPECTER2 embeddings, which apparently produce good embeddings for science, but the API is rather different so I didn’t hot-swap it in for testing.

1.3 At teensy blog scale

The entire approach — embed everything, dump to a flat numpy array, brute-force cosine distance — is viable because ~2000 documents is tiny. There’s no approximate nearest neighbour index, no vector database, no sharding. The full pairwise similarity matrix is 2000×2000, which fits in L2 cache.

The script is open source. You can download it from similar_posts_static_site.py.

However the version that runs this site has many improvements I did not include there, sorry. Nag me on github if you want the latest version.

2 Example 2: Local semantic search with QMD

The similarity index described above answers “what posts are related to this one?” but it doesn’t answer “where did I write about topic X?” — a retrieval problem rather than a similarity problem. These are related but distinct. For example, similarity works best with whole-document embeddings (as I do with mxbai-embed-large), while retrieval benefits from sub-document chunking so we can find the right passage inside a long post.

QMD (by Tobi Lütke) is a local CLI search engine that combines BM25 full-text search, vector semantic search, and LLM reranking — all running locally via node-llama-cpp with GGUF models. It has its own opinions about embedding models (embeddinggemma/Qwen3-Embedding) rather than reusing the Ollama mxbai-embed-large I use for similarity, which is arguably correct: different models for different tasks. OTOH it’s an annoying waste of disk space.

2.1 Installation

npm install -g @tobilu/qmd     # released verwsion
npm install -g tobi/qmd        # OR bleeding edge

Or run without installing via npx @tobilu/qmd.

2.2 Indexing the blog

Register the blog content directories as a collection and generate embeddings. Note the explicit --mask flag — QMD defaults to **/*.md and silently ignores unrecognised flags, so without --mask it will only find plain markdown files. Restricting to **/*.qmd ensures only Quarto source content is indexed.

# Register the content directories — only .qmd source files
qmd collection add /path/to/blog --name blog --mask "**/*.qmd"

# Index and generate vector embeddings
qmd embed

The embed step runs a local GGUF embedding model via llama.cpp. It chunks documents at paragraph boundaries by default (the paragraph chunking strategy), which is the right granularity for prose blog posts. For source code we could use the ast strategy, which does AST-aware chunking, but that’s not relevant here.

2.3 Search

QMD exposes three search tiers:

# Fast keyword search (BM25)
qmd search "particle filters"

# Semantic similarity search
qmd vsearch "sequential Monte Carlo methods"

# Hybrid: BM25 + vectors + LLM reranking (best quality, slowest)
qmd query "how do I implement a bootstrap particle filter"

The hybrid query mode is usually what I want. We can scope to the blog collection with -c blog, get JSON output with --json, or retrieve the full document body of a result with qmd get “#docid”.

2.4 Opening results in VS Code

As of the latest trunk build, QMD emits clickable OSC 8 terminal hyperlinks in its search results. In a modern terminal (iTerm2, Kitty, WezTerm, Ghostty), each result path is a clickable link that opens the file at the matching line in your editor — like <a href> in HTML, but in the terminal. The URI scheme is configurable, so it works with VS Code (vscode://file/), Cursor (cursor://file/), Zed, etc.

For older terminals or cases where OSC 8 isn’t supported, a fish shell wrapper still works:

function qopen --description "QMD search → open in VS Code"
    set -l blog_root /Users/dan/Source/livingthing
    qmd query $argv 2>/dev/null | \
        string match -r 'qmd://\w+/\S+' | \
        string replace -r '^qmd://\w+/' "$blog_root/" | \
        head -1 | \
        read -l loc
    and code --goto $loc
end

Why not a VS Code extension?

As of early 2026, there is AFAICT no good VS Code extension for semantic search over prose markdown. The extensions that exist — Zilliz Semantic Code Search, sturdy-dev/semantic-code-search — are designed around code structure (AST-based chunking of functions and classes). They list Markdown as a supported filetype, but they don’t chunk prose at paragraph boundaries or handle frontmatter, math blocks, or callouts.

The markdown knowledge base extensions (Foam, Markdown Memo) handle wikilinks and graph visualization but don’t do vector search at all.

So for now, a CLI tool + shell wrapper is the state of the art for “semantic search my 2000 prose files and open the result in my editor.”

2.5 MCP server for Claude integration

QMD also runs as an MCP server, which means it can be wired into Claude Desktop, Claude Code, or Cursor. It exposes query, get, multi_get, and status as MCP tools, so I can ask Claude “search my blog for posts about kernel methods” and it will use QMD behind the scenes.

# stdio mode (for Claude Desktop / Claude Code)
qmd mcp

# HTTP mode (persistent, shareable)
qmd mcp --http --port 8181
qmd mcp --http --daemon  # background

Claude Code — register globally with:

claude mcp add qmd -- qmd mcp

Or scope it to the blog repo by adding a .mcp.json to the repo root:

{
  "mcpServers": {
    "qmd": {
      "command": "qmd",
      "args": ["mcp"]
    }
  }
}

Claude Desktop — add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS), merging into the existing mcpServers object.

2.6 Architecture notes

The key design decisions that make QMD a good fit here:

Chunking is paragraph-level by default, which is the right grain for retrieval over prose. My existing index-blog pipeline embeds whole documents because it’s solving a different problem (post-level similarity).
BM25 + vectors + reranking is the standard three-stage retrieval pipeline from the information retrieval literature. Having all three locally is nice.
SQLite-backed — no separate vector database server to manage. My existing similarity pipeline uses numpy .npz files and per-doc JSON, which is a different approach but shares the same virtue: everything is local files, no infrastructure.
MCP and CLI from one tool — no need to run two separate services to get both terminal and Claude access.
Separate from the similarity pipeline — the two systems serve different purposes with different optimal chunking strategies, so keeping them independent is the right call even though it means two embedding models in memory.

2.7 Alternative: txtai for composable pipelines

QMD is opinionated — it bundles its own embedding model, chunker, BM25 index, and reranker into one tool. That’s a strength when we want to get running quickly, but a limitation when its choices don’t suit the corpus.

txtai takes the opposite approach: it’s a Python framework where each stage of the pipeline — chunking, embedding, indexing, retrieval, reranking — is a swappable component. We could plug in Ollama with mxbai-embed-large (reusing the same model as the similarity pipeline), write a custom chunker that understands .qmd frontmatter and fenced code blocks, and choose our own ANN backend. It also exposes a REST API, so wrapping it as an MCP server would be straightforward.

The tradeoff is assembly time: QMD took five minutes to set up; a txtai pipeline would take an afternoon. But if QMD’s regex paragraph chunker starts mangling our math blocks or YAML frontmatter, having the option to swap in something that understands our document format is worth knowing about.

3 When we outgrow brute force

Both examples above work because the corpus is small — ~2000 documents, a few thousand chunks. At that scale the choice of infrastructure barely matters; everything fits in memory and every query is fast. Here’s what changes as we scale up, and how the RAG pipeline decomposes into separately scalable stages.

3.1 Decomposing the retrieval pipeline

A vector search system has three stages:

Chunking — split documents into passages. For retrieval we want sub-document chunks (paragraphs, sections); for similarity we might embed whole documents. The right chunk size depends on the task, not the corpus size.
Embedding — map each chunk to a vector. This is a one-time batch job (plus incremental updates). The cost scales linearly with corpus size and is dominated by model inference time. My blog takes a few minutes to embed from scratch on CPU via Ollama on a laptop GPU; at 1M+ documents we’d want an embedding API or distributed GPU inference.
Indexing and retrieval — given a query vector, find the nearest chunks. This is where scale matters most. At 2000 documents, brute-force cosine distance is fine. At 100k+ documents, we need an approximate nearest neighbor (ANN) index — see vector databases for the options and tradeoffs.

Both examples on this page stop here — they return documents or passages, which is all we need for “related posts” or “find where I wrote about X”.

Retrieval-augmented generation (RAG) adds a fourth stage: feed the retrieved chunks to an LLM as context and generate a synthesised answer. This doesn’t change the retrieval infrastructure (we still only pass the top-k results to the LLM), but it does mean chunk quality directly determines generation quality — garbage in, garbage out. RAG is the standard architecture behind “chat with your documents” products and enterprise search assistants.

3.2 Infrastructure spectrum

At the small end (this blog), the “stack” is:

Ollama → numpy .npz → Q @ E.T → JSON files

At the large end (production RAG), we’d see something like:

Embedding API → Vector database (Pinecone, Qdrant, Weaviate) → ANN retrieval → Reranker → LLM

The Algolia search that powers the search box at the top of the page presumably uses similar technology under the hood. However, it’s run by a third party who serves the content from their own servers, so I can’t really speak to what they’re doing behind the scenes.

The middle ground — say, 10k–100k documents — is where tools like QMD, ChromaDB, and LanceDB live. They provide an ANN index and persistence without requiring us to run a separate database server.

If you’re wondering whether you need a vector database: if your corpus fits in a single numpy array and queries take less than a second, you don’t. The infrastructure is there to solve problems you might not have yet. See vector databases for when and why you might.

4 Interesting embeddings

4.1 Generic text embeddings

Nomic embeddings: Introducing Nomic Embed: A Truly Open Embedding Model are small, fast, and open. I trialled them for this blog and they were OK but not amazing, even though they have a big context window.
mxbai embeddings: Open Source Strikes Bread - New Fluffy Embedding Model | Mixedbread These embeddings are generated by a relatively large model that uses a relatively small number of tokens. Counter-intuitively, they were great at classifying the text of this blog, even though they only look at the first 512 tokens.

4.2 Specialised for scientific text

SPECTER2: Adapting scientific document embeddings to multiple fields and task formats:

Models like SPECTER and SciNCL are adept at embedding scientific documents as they are specifically trained so that papers close to each other in the citation network are close in the embedding space as well. For each of these models, the input paper text is represented by a combination of its title and abstract. SPECTER, released in 2020, supplies embeddings for a variety of our offerings at Semantic Scholar - user research feeds, author name disambiguation, paper clustering, and many more! Along with SPECTER, we also released SciDocs - a benchmark of 7 tasks for evaluating the efficacy of scientific document embeddings. SciNCL, which came out last year, improved upon SPECTER by relying on nearest-neighbour sampling rather than hard citation links to generate training examples.

This model and its ilk are truly targeted at research discovery, and they are so good at it that we might argue they have “solved” the knowledge topology problem for scientific papers.

I implemented a search engine for ICLR 2025 using the SPECTER2 embeddings and I was impressed with the quality of the results. Note the API is a little different from the default huggingface API used by mxbai et al.; we need to use the “adapters” library.

5 Tools

ChromaDB is a vector database with a focus on search and retrieval. I used it to store vector embeddings to note “similar posts” on this site and I can report it was incredibly simple for my use case, and scales well to thousands of documents at least. It’s based on sqlite.

6 Internet search

6.1 Commercial services searching the internet

See internet search.

6.2 Free/FOSS-ish internet search

Like perplexity, but locally.

6.3 Web scraping for RAG pipelines

6.3.1 Jina

Below I mention Firecrawl as a web scraping tool. It is kind of the default because of being famous etc.

However it is too gold-plated for me. As far as I can tell, they charge a premium for using browsers for scraping, which is a good way to get clean content from dynamic JS-first websites, but it’s also slow and resource-intensive and overkill for 80% of my use cases.

Jina is way cheaper.

claude mcp add -s user --transport http jina https://mcp.jina.ai/v1 \
  --header "Authorization: Bearer ${JINA_API_KEY}"`

6.3.2 Firecrawl

Firecrawl is a web scraping and content extraction service that converts web pages into clean markdown — the format I’d want for feeding into a RAG pipeline or building a search index over web content, or just It exposes an MCP server (docs) with tools for scraping single pages, batch processing, site crawling, and web search, so an LLM agent can pull in web content as context. Supports both cloud and self-hosted deployment.

This is complementary to local search tools like QMD — Firecrawl gets the content off the web; QMD (or txtai, or some idiosyncratic pipeline of one’s own) indexes it locally.

6.4 Others

7 Incoming

8 References

Beltagy, Lo, and Cohan. 2019. “SciBERT: A Pretrained Language Model for Scientific Text.”

Cohan, Feldman, Beltagy, et al. 2020. “SPECTER: Document-Level Representation Learning Using Citation-Informed Transformers.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.

Es, James, Espinosa Anke, et al. 2024. “RAGAs: Automated Evaluation of Retrieval Augmented Generation.” In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations.

Fan, Ding, Ning, et al. 2024. “A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models.” In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD ’24.

Gao, Xiong, Gao, et al. 2024. “Retrieval-Augmented Generation for Large Language Models: A Survey.”

Singh, D’Arcy, Cohan, et al. 2022. “SciRepEval: A Multi-Format Benchmark for Scientific Document Representations.” In.

Venkit, Laban, Zhou, et al. 2024. “Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses.”