Placeholder on AI search. Retrieval-augmented generation etc. Vector databases.
Theory and practice.
1 This blog has AI indexing
Those “suspiciously similar posts” links at the top of the page are generated by an AI model that indexes the text and gives each post a topic embedding. This is a naïve but easy and effective way to find similar posts.
Non-obvious discovery: I auditioned two AI models for the task, nomic-ai/nomic-embed-text-v1.5 and mixedbread-ai/mxbai-embed-large-v1. The latter gave more intuitively correct results even though it ignores most of the post, looking only at the first 512 tokens, which is basically the title, categories, and a paragraph or two. nomic
looks at most of the post, with 8192 tokens, but that seemed to produce generally worse results. I am curious about the SPECTER2 embeddings, which has “good” embeddings for science apparently, but it has a rather different API so I didn’t hot-swap it in for testing.
The Algolia search that provides the search box at the top of the page is presumably similar in terms of the techn that makes it go. However, that is run by a third party who serves the content cleverly from their servers, so I cannot really speak to what they are doing behind the scenes.
The script is open source. You can download it from similar_posts_static_site.py
2 Retrieval-augmented generation theory
TBD
3 Interesting embeddings
3.1 Generic text embeddings
- Nomic embeddings: Introducing Nomic Embed: A Truly Open Embedding Model are small, fast, and open. I trialled them for this blog and they were OK but not amazing, even though they have a big context window.
- mxbai embeddings: Open Source Strikes Bread - New Fluffy Embedding Model | Mixedbread These embeddings are generated by a relatively large model that uses a relatively small number of tokens. Counter-intuitively, they were great at classifying the text of this blog, even though they only look at the first 512 tokens.
3.2 Specialised for scientific text
SPECTER2: Adapting scientific document embeddings to multiple fields and task formats:
Models like SPECTER and SciNCL are adept at embedding scientific documents as they are specifically trained so that papers close to each other in the citation network are close in the embedding space as well. For each of these models, the input paper text is represented by a combination of its title and abstract. SPECTER, released in 2020, supplies embeddings for a variety of our offerings at Semantic Scholar - user research feeds, author name disambiguation, paper clustering, and many more! Along with SPECTER, we also released SciDocs - a benchmark of 7 tasks for evaluating the efficacy of scientific document embeddings. SciNCL, which came out last year, improved upon SPECTER by relying on nearest-neighbour sampling rather than hard citation links to generate training examples.
This model and its ilk are truly targeted at research discovery, and they are so good at it that you might argue they have “solved” the knowledge topology problem for scientific papers.
I implemented a search engine for ICLR 2025 using the SPECTER2 embeddings and I was impressed with the quality of the results. Note the API is a little different than the default huggingface
API used by mxbai
et al.; you need to use the “adapters” library.
4 Tools
ChromaDB is a vector database with a focus on search and retrieval. I used it to store vector embedding to note “similar posts” on this site and I can report it was incredibly simple for my use case, and scales well to thousands of documents at least. It’s based on sqlite.
5 Internet search
5.1 Commercial services searching the internet
See internet search.
5.2 Free/FOSS-ish internet search
Like, perplexity but locally.
- nilsherzig/LLocalSearch: LLocalSearch is a completely locally running search aggregator using LLM Agents. The user can ask a question and the system will use a chain of LLMs to find the answer. The user can see the progress of the agents and the final answer. No OpenAI or Google API keys are needed.
- nashsu/FreeAskInternet: FreeAskInternet is a completely free, PRIVATE and LOCALLY running search aggregator & answer generator using MULTI LLMs, without GPU needed. The user can ask a question and the system will make a multi-engine search and combine the search result to LLM and generate the answer based on search results. It’s all FREE to use..