AI search

Retrieval-augmented generation for the working schlub

February 6, 2024 — April 18, 2025

boring
computers are awful together
faster pussycat
incentive mechanisms
mind
NLP
provenance
search
wonk
Figure 1

Placeholder on AI search. Retrieval-augmented generation etc. Vector databases.

Theory and practice.

1 This blog has AI indexing

Those “suspiciously similar posts” links at the top of the page are generated by an AI model that indexes the text and gives each post a topic embedding. This is a naïve but easy and effective way to find similar posts.

Non-obvious discovery: I auditioned two AI models for the task, nomic-ai/nomic-embed-text-v1.5 and mixedbread-ai/mxbai-embed-large-v1. The latter gave more intuitively correct results even though it ignores most of the post, looking only at the first 512 tokens, which is basically the title, categories, and a paragraph or two. nomic looks at most of the post, with 8192 tokens, but that seemed to produce generally worse results. I am curious about the SPECTER2 embeddings, which has “good” embeddings for science apparently, but it has a rather different API so I didn’t hot-swap it in for testing.

The Algolia search that provides the search box at the top of the page is presumably similar in terms of the techn that makes it go. However, that is run by a third party who serves the content cleverly from their servers, so I cannot really speak to what they are doing behind the scenes.

The script is open source. You can download it from similar_posts_static_site.py

2 Retrieval-augmented generation theory

TBD

3 Interesting embeddings

3.1 Generic text embeddings

3.2 Specialised for scientific text

SPECTER2: Adapting scientific document embeddings to multiple fields and task formats:

Models like SPECTER and SciNCL are adept at embedding scientific documents as they are specifically trained so that papers close to each other in the citation network are close in the embedding space as well. For each of these models, the input paper text is represented by a combination of its title and abstract. SPECTER, released in 2020, supplies embeddings for a variety of our offerings at Semantic Scholar - user research feeds, author name disambiguation, paper clustering, and many more! Along with SPECTER, we also released SciDocs - a benchmark of 7 tasks for evaluating the efficacy of scientific document embeddings. SciNCL, which came out last year, improved upon SPECTER by relying on nearest-neighbour sampling rather than hard citation links to generate training examples.

This model and its ilk are truly targeted at research discovery, and they are so good at it that you might argue they have “solved” the knowledge topology problem for scientific papers.

I implemented a search engine for ICLR 2025 using the SPECTER2 embeddings and I was impressed with the quality of the results. Note the API is a little different than the default huggingface API used by mxbai et al.; you need to use the “adapters” library.

4 Tools

ChromaDB is a vector database with a focus on search and retrieval. I used it to store vector embedding to note “similar posts” on this site and I can report it was incredibly simple for my use case, and scales well to thousands of documents at least. It’s based on sqlite.

6 Incoming

7 References

Es, James, Espinosa Anke, et al. 2024. RAGAs: Automated Evaluation of Retrieval Augmented Generation.” In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations.
Fan, Ding, Ning, et al. 2024. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models.” In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD ’24.
Gao, Xiong, Gao, et al. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey.”
Venkit, Laban, Zhou, et al. 2024. Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses.”