AI search

Retrieval-augmented generation for the working schlub

2024-02-06 — 2025-04-18

boring

computers are awful together

faster pussycat

incentive mechanisms

mind

NLP

provenance

wonk

Suspiciously similar content

Placeholder on AI search over a defined data set. Retrieval-augmented generation etc. Vector databases.

Theory and practice.

1 This blog has AI indexing

Those “suspiciously similar posts” links at the top of the page are generated by an AI model that indexes the text and gives each post a topic embedding. This is a naïve but easy and effective way to find similar posts.

Non-obvious discovery: I auditioned two AI models for the task, nomic-ai/nomic-embed-text-v1.5 and mixedbread-ai/mxbai-embed-large-v1. The latter gave more intuitively correct results even though it ignores most of the post, looking only at the first 512 tokens, which is basically the title, categories, and a paragraph or two. nomic looks at most of the post, with 8192 tokens, but that seemed to produce generally worse results. I am curious about the SPECTER2 embeddings, which has “good” embeddings for science apparently, but it has a rather different API so I didn’t hot-swap it in for testing.

The Algolia search that provides the search box at the top of the page is presumably similar in terms of the techn that makes it go. However, that is run by a third party who serves the content cleverly from their servers, so I cannot really speak to what they are doing behind the scenes.

The script is open source. You can download it from similar_posts_static_site.py

2 Retrieval-augmented generation theory

TBD

3 Interesting embeddings

3.1 Generic text embeddings

Nomic embeddings: Introducing Nomic Embed: A Truly Open Embedding Model are small, fast, and open. I trialled them for this blog and they were OK but not amazing, even though they have a big context window.
mxbai embeddings: Open Source Strikes Bread - New Fluffy Embedding Model | Mixedbread These embeddings are generated by a relatively large model that uses a relatively small number of tokens. Counter-intuitively, they were great at classifying the text of this blog, even though they only look at the first 512 tokens.

3.2 Specialised for scientific text

SPECTER2: Adapting scientific document embeddings to multiple fields and task formats:

Models like SPECTER and SciNCL are adept at embedding scientific documents as they are specifically trained so that papers close to each other in the citation network are close in the embedding space as well. For each of these models, the input paper text is represented by a combination of its title and abstract. SPECTER, released in 2020, supplies embeddings for a variety of our offerings at Semantic Scholar - user research feeds, author name disambiguation, paper clustering, and many more! Along with SPECTER, we also released SciDocs - a benchmark of 7 tasks for evaluating the efficacy of scientific document embeddings. SciNCL, which came out last year, improved upon SPECTER by relying on nearest-neighbour sampling rather than hard citation links to generate training examples.

This model and its ilk are truly targeted at research discovery, and they are so good at it that you might argue they have “solved” the knowledge topology problem for scientific papers.

I implemented a search engine for ICLR 2025 using the SPECTER2 embeddings and I was impressed with the quality of the results. Note the API is a little different than the default huggingface API used by mxbai et al.; you need to use the “adapters” library.

4 Tools

ChromaDB is a vector database with a focus on search and retrieval. I used it to store vector embedding to note “similar posts” on this site and I can report it was incredibly simple for my use case, and scales well to thousands of documents at least. It’s based on sqlite.

5 Internet search

5.1 Commercial services searching the internet

See internet search.

5.2 Free/FOSS-ish internet search

Like, perplexity but locally.

5.3 Others

6 Incoming

7 References

Beltagy, Lo, and Cohan. 2019. “SciBERT: A Pretrained Language Model for Scientific Text.”

Cohan, Feldman, Beltagy, et al. 2020. “SPECTER: Document-Level Representation Learning Using Citation-Informed Transformers.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.

Es, James, Espinosa Anke, et al. 2024. “RAGAs: Automated Evaluation of Retrieval Augmented Generation.” In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations.

Fan, Ding, Ning, et al. 2024. “A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models.” In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD ’24.

Gao, Xiong, Gao, et al. 2024. “Retrieval-Augmented Generation for Large Language Models: A Survey.”

Singh, D’Arcy, Cohan, et al. 2022. “SciRepEval: A Multi-Format Benchmark for Scientific Document Representations.” In.

Venkit, Laban, Zhou, et al. 2024. “Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses.”