Research discovery, synthesis planning

Has someone answered that question I have not worked out how to ask yet?

2019-01-22 — 2025-05-04

academe

collective knowledge

faster pussycat

how do science

institutions

mind

networks

provenance

sociology

wonk

Suspiciously similar content

Recommender systems for academics are hard. I suspect they’re tougher than regular ones because the content should be new and hard to relate to existing stuff. Finding connections is a publishable result in itself. We could imagine this problem as a hard-nosed applied version of the vaguely woo-woo idea of knowledge topology. As such it might have been effectively technically solved by recent advances in topic embeddings and retrieval-augmented generation models.

Interactions with peer review systems are complicated. Could this integrate with peer review in a useful way? Can we have services like Canopy, Pinterest, or keen for scientific knowledge? How can we balance recall and precision for the needs of academics?

The information environment is challenging. I am fond of Elizabeth Van Nostrand’s summary:

assessing a work often requires the same skills/knowledge you were hoping to get from said work. You can’t identify a good book in a field until you’ve read several. But improving your starting place does save time, so I should talk about how to choose a starting place.

One difficulty is that this process is heavily adversarial. A lot of people want you to believe a particular thing, and a larger set don’t care what you believe as long as you find your truth via their Amazon affiliate link […] The latter group fills me with anger and sadness; at least the people trying to convert you believe in something (maybe even the thing they’re trying to convince you of). The link farmers are just polluting the commons.

My paraphrase: Knowledge discovery would likely be intrinsically difficult in a hypothetical beneficent world with great sharing mechanisms, but the economics of the attention economy, advertising, and weaponised media mean we should be wary of the mechanisms currently available.

If I accept this, then the corollary is that my scattershot approach to link sharing might detract from the value of this blog to the wider world.

1 Theory

José Luis Ricón, a.k.a. Nintil, wonders about A better Google Scholar based on his experience trying to create a better Meta Scholar for Syntopic reading. Robin Hanson, of course, has much to say on potentially better mechanism design for scientific discovery. I have qualms about his implied cash rewards system crowding out reputational awards; I think there is merit in using non-cash currency in that particular economy, but I’m open to being persuaded.

Of course, LLM-based methods are prominent now. Modern information-retrieval theory to come.

2 “Deep Research” projects

TODO: mention elicit, openai, perplexity etc

2.1 FutureHouse Platform

FutureHouse Platform: Superintelligent AI Agents for Scientific Discovery

Fresh off the rack an interesting looking. It not only synthesises existing literature but does gap research and identifying comparative advantage etc.

2.2 Elicit

Elicit: The AI Research Assistant exploits large language models to solve this problem:

Elicit uses language models to help you automate research workflows, like parts of literature review.

Elicit can find relevant papers without perfect keyword match, summarise takeaways from the paper specific to your question, and extract key information from the papers.

While answering questions with research is the main focus of Elicit, there are also other research tasks that help with brainstorming, summarisation, and text classification.

2.3 OpenAi Deepresearch

Introducing Deep Research from OpenAI:

An agent that uses reasoning to synthesise large amounts of online information and complete multi-step research tasks for you.

2.4 Perplexity AI

2.5 Consensus

Consensus: AI-powered Academic Search Engine

Consensus is an academic search engine powered by AI. Students and researchers at over 5,000 universities worldwide already research with Consensus. We partner with libraries, labs, and universities to provide the best academic research tools to students and faculty.

3 Tooling

3.1 AI2 models

Figure 2: Overview of the OpenScholar pipeline: OpenScholar leverages a large-scale datastore consisting of 45 million papers and uses a custom-trained retriever, reranker, and 8B parameter language model to answer questions based on up-to-date scientific literature (from Asai et al. 2024)

Ai2 OpenScholar: Scientific literature synthesis with retrieval-augmented language models | Ai2/ Source provides infrastructure for search via vector embeddings specialised for science papers.

Dissatisfied with the crappy search for ICLR, I built a quick semantic search using this Ai2 system. It was really easy, and really good! You will find it here if you want to install it on your machine: danmackinlay/openreview_finder. It takes about 10 minutes to download and index the ICLR 2025 papers.

Scientific progress hinges on our ability to find, synthesise, and build on relevant knowledge from the scientific literature. However, the exponential growth of this literature—with millions of papers now published each year—has made it increasingly difficult for scientists to find the information they need or even stay abreast of the latest findings in a single subfield.

To help scientists effectively navigate and synthesise scientific literature, we introduce Ai2 OpenScholar—a collaborative effort between the University of Washington and the Allen Institute for AI. OpenScholar is a retrieval-augmented language model (LM) designed to answer user queries by first searching for relevant papers in the literature and then generating responses grounded in those sources. Below are some examples:

On ScholarQABench, our new benchmark of open-ended scientific questions, OpenScholar-8B sets the state of the art on factuality and citation accuracy. For instance, on biomedical research questions, GPT-4o hallucinated more than 90% of the scientific papers that it cited, whereas OpenScholar-8B—by construction—remains grounded in real retrieved papers. To evaluate the effectiveness of OpenScholar in a real-world setup, we recruited 20 scientists working in computer science, biomedicine, and physics, and asked them to evaluate OpenScholar responses against expert-written answers. Across these three scientific disciplines, OpenScholar-8B’s responses were considered more useful than expert-written answers for the majority of questions.

OpenScholar is a research prototype, and it is just our first step toward building AI systems that can effectively assist scientists and accelerate scientific discovery. To support research in this direction, we have open-sourced all of our code, LM, retriever and re-ranker checkpoints, retrieval index, and data, including the training data for our language model and retriever, our OpenScholar datastore of academic papers, and the evaluation data in ScholarQABench. To our knowledge, this is the first open release of a complete pipeline for a scientific assistant LM—from data to training recipes to model checkpoints—and we’re excited to see how the community builds upon it.

Check out the Ai2 OpenScholar Demo at openscholar.allen.ai!

While the original OpenScholar model/code/data/results cover multiple scientific domains, this demo is currently limited to questions and papers about computer science. We hope to expand support to other scientific fields soon. We are also preparing to publicly release the retrieval service that backs our demo as a separate public API, which will provide full-text search over open-access papers available through Ai2’s Semantic Scholar API.

Our OpenScholar-8B (OS-8B) system comprises the following components:

OpenScholar Datastore: A collection of more than 45M papers from Semantic Scholar and ~250M corresponding passage embeddings. The underlying data comes from an updated version of peS2o (Soldaini et al., 2024) that consists of papers up to October 2024.

Specialised Retrievers and Rerankers: These tools are trained specifically to identify relevant passages from our scientific literature datastore.

Specialised 8B Language Model: An 8B-parameter LM optimised for scientific literature synthesis tasks, balancing performance with computational efficiency. To train this, we fine-tune Llama 3.1 8B on synthetic data generated from our iterative self-feedback generation pipeline, described below.

Iterative Self-Feedback Generation: At inference, we use iterative self-feedback to refine model outputs through natural language feedback. Each iteration involves additionally retrieving more papers, allowing us to improve quality and close citation gaps.

Our datastore, retriever and reranking models, and self-feedback generation pipeline can also be applied on top of other off-the-shelf LMs. Below, we discuss results on both OS-8B, which uses our specialised 8B model, and on OS-GPT4o, which uses GPT-4o as the base LM.

Ai2 ScholarQA is meant to satisfy literature searches that require insights from multiple relevant documents, and synthesise those insights into a comprehensive report. After receiving a query, the system first queries the index for the top k passages. These passages are further re-ranked with a pretrained transformer model and the top 50 candidates are retained for further processing. The answer generation is a 3-step process driven by prompts to an LLM:

Quote extraction: The top re-ranked passages are fed to the LLM to select the most relevant quotes for answering the user query. This improves the precision of the candidate passages and reduces context overload in subsequent steps.

Answer outline and clustering: The quotes are then provided to the LLM to generate a plan, which includes section headers for the report along with the relevant quotes to be included in each section. The format of each section can be either a paragraph or a bulleted list. Paragraph sections can convey nuanced relations between different papers, while bulleted list sections enumerate closely related papers, such as models, datasets, or interactive systems for the same tasks.

Report generation: The section headers and quotes are finally used to generate the report. The report is generated one section at a time conditioned on the text from previously generated sections. The section text is accompanied by a TLDR summary at the top along with attribution to the quotes and their papers for further analysis.

4 Discovery projects

4.1 Scholar Inbox

Scholar Maps by Scholar Inbox

Scholar Inbox is a personal paper recommender which enables researchers to stay up-to-date with the most relevant progress in their field based on their personal research interests. Scholar Inbox is free of charge and daily indexes all of arXiv, bioRxiv, medRxiv and ChemRxiv as well as several open access proceedings in computer science.

Background here at Autonomous Vision Blog: Scholar Inbox.

This is an amazing service. They have some very thoughtful details, such as allowing you to filter by the next conference you are attending etc.

4.2 SciLire

SciLire: Science Literature Review AI Tool from CSIRO.

Just saw a tech demo of this tool by my very own colleagues; it looks interesting for high-speed AI-augmented literature review.

4.3 Researchrabbit

ResearchRabbit:

Spotify for Papers: Just like in Spotify, you can add papers to collections. ResearchRabbit learns what you love and improves its recommendations!

Personalised Digests: Keep up with the latest papers related to your collections! If we’re not confident something’s relevant, we don’t email you—no spam!

Interactive Visualisations: Visualise networks of papers and co-authorships. Use graphs as new “jumping off points” to dive even deeper!

Explore Together: Collaborate on collections, or help kickstart someone’s search process! And leave comments as well!

4.4 scite

scite: see how research has been cited

Citations are classified by a deep learning model that is trained to identify three categories of citation statements: those that provide contrasting or supporting evidence for the cited work, and others, which mention the cited study without providing evidence for its validity. Citations are classified by rhetorical function, not positive or negative sentiment.

Citations are not classified as supporting or contrasting by positive or negative keywords.

A Supporting citation can have a negative sentiment and a Contrasting citation can have a positive sentiment. Sentiment and rhetorical function are not correlated.

Supporting and Contrasting citations do not necessarily indicate that the exact set of experiments was performed. For example, if a paper finds that drug X causes phenomenon Y in mice and a subsequent paper finds that drug X causes phenomenon Y in yeast but both come to this conclusion with different experiments—this would be classified as a supporting citation, even though identical experiments were not performed.

Citations that simply use the same method, reagent, or software are not classified as supporting. To identify methods citations, you can filter by the section.

For full technical details including exactly how we do classification, what classifications and classification confidence mean, please read our recent publication describing how scite was built: (Nicholson et al. 2021)/

4.5 Connectedpapers

Connected Papers

To create each graph, we analyse an order of ~50,000 papers and select the few dozen with the strongest connections to the origin paper.

In the graph, papers are arranged according to their similarity. That means that even papers that do not directly cite each other can be strongly connected and very closely positioned. Connected Papers is not a citation tree.

Our similarity metric is based on the concepts of Co-citation and Bibliographic Coupling. According to this measure, two papers that have highly overlapping citations and references are presumed to have a higher chance of discussing a related subject matter.

Our algorithm then builds a Force Directed Graph to visually cluster similar papers together and push less similar papers away from each other. Upon node selection, we highlight the shortest path from each node to the origin paper in similarity space.

Our database is connected to the Semantic Scholar Paper Corpus (licensed under ODC-BY). Their team has done an amazing job of compiling hundreds of millions of published papers across many scientific fields.

Also:

You can use Connected Papers to:

Get a visual overview of a new academic field Enter a typical paper and we’ll build you a graph of similar papers in the field. Explore and build more graphs for interesting papers you find — soon you’ll have a real, visual understanding of the trends, popular works, and dynamics of the field you’re interested in.

Make sure you haven’t missed an important paper In some fields like Machine Learning, so many new papers are published it’s hard to keep track. With Connected Papers you can just search and visually discover important recent papers. No need to keep lists.

Create the bibliography for your thesis Start with the references that you will definitely want in your bibliography and use Connected Papers to fill in the gaps and find the rest!

Discover the most relevant prior and derivative works Use our Prior Works view to find important ancestor works in your field of interest. Use our Derivative Works view to find literature reviews of the field, as well as recently published State of the Art that followed your input paper.

4.6 papr

papr — “tinder for preprints”

We all know the peer review system is hopelessly overmatched by the deluge of papers coming out. papr reviews use the wisdom of the crowd to quickly filter papers considered interesting and accurate. Add your quick judgements about papers to thousands of other scientists’ reviews around the world.

You can use the app to keep track of interesting papers and share them with your friends. Spend 30 minutes quickly sorting through the latest literature, and papr will keep track of those papers for future review.

With papr, you can filter to only see papers that match your interests, keyword matches, or papers highly rated by others. Ensure your literature review is productive and efficient.

I appreciate the quality problem is important, but I am unconvinced by their topic keywords idea. Quality is only half the problem for me, and the topic-filtering problem looks harder.

4.7 Daily papers

Daily Papers seems similar to arxiv-sanity, but they are more actively maintained and less coherently explained. Their paper rankings seem to incorporate… Twitter hype?

Keep track of arXiv papers and the tweet mini-commentaries that your friends are discussing on Twitter.

Somehow, some researchers have time for Twitter. The opinions of such multitasking prodigies are probably worthy of note.

Journal / Author Name Estimator identifies potential collaborators and journals by semantic similarity search on the abstract
Grant matchmaker suggests people with similar grants inside the USA’s NIH

4.8.3 arxiv sanity

~~Arxiv-sanity~~

Aimed to prioritise the arXiv paper-publishing firehose so you can discover papers of interest to you, at least if those interests are in machine learning. Now defunct.

Arxiv Sanity Preserver

Built by @karpathy to accelerate research. Serving the last 26179 papers from cs.[CV|CL|LG|AI|NE]/stat.ML

Includes Twitter-hype sorting, TF-IDF clustering, and other such basic but important baby steps towards web2.0 style information consumption.

The servers are overloaded of late, possibly because of the unfavourable scaling of all the SVMs it uses, the continued growth of Arxiv, or epidemic addiction to intermittent variable rewards amongst machine learning researchers. That last reason is why I have opted out of checking for papers.

I could run my own installation — it is open source — but the download and processing requirements are prohibitive. Arxiv is big and fast.

5 Paper analysis/annotation

Academic reading workflow problem?.

6 Finding copies

unpaywall and oadoi seem to be indices of non-paywalled preprints of paywalled articles. oadoi is a website, unpaywall is a browser extension. Citeseer also. There are also shadow libraries

7 Trend analysis

I’ll be honest, I wanted this because a reviewer claimed that something was “outdated,” and I got angry, not only because it is a useless criticism (“incorrect” is valid) but also I suspected that the reviewer was wrong about how hip the topic was due to living in their own filter bubble. So I wasted an hour or two plotting the number of papers on the topic over time, and then I realised that I needed to do some mindfulness meditation instead.

The topic, by the way, was factor graphs; The plots I generated before going off to meditate are an interesting test case for the various tools:

8 Incoming

8.1 Incoming AI-assist

9 Reading groups and co-learning

A great way to get things done. How can we make reading together easier?

The Journal Club is a web-based tool designed to help organise journal clubs, aka reading groups. A journal club is a group of people coming together at regular intervals, e.g., weekly, to critically discuss research papers. The Journal Club makes it easy to keep track of information about the club’s meeting time and place as well as the list of papers coming up for discussion, papers that have been discussed in previous meetings, and papers proposed by club members for future discussion.

10 References

Asai, He, Shao, et al. 2024. “OpenScholar: Synthesizing Scientific Literature with Retrieval-Augmented LMs.”

Beltagy, Lo, and Cohan. 2019. “SciBERT: A Pretrained Language Model for Scientific Text.”

Borondo, Borondo, Rodriguez-Sickert, et al. 2014. “To Each According to Its Degree: The Meritocracy and Topocracy of Embedded Markets.” Scientific Reports.

Cohan, Feldman, Beltagy, et al. 2020. “SPECTER: Document-Level Representation Learning Using Citation-Informed Transformers.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.

Coscia, and Vandeweerdt. 2022. “Posts on Central Websites Need Less Originality to Be Noticed.” Scientific Reports.

Gao, Ju, Jiang, et al. 2025. “A Semantic Search Engine for Mathlib4.”

Kang, Zhang, Jiang, et al. 2024. “Taxonomy-Guided Semantic Indexing for Academic Paper Search.”

Nicholson, Mordaunt, Lopez, et al. 2021. “Scite: A Smart Citation Index That Displays the Context of Citations and Classifies Their Intent Using Deep Learning.” Quantitative Science Studies.

Shen, Lin, Zhang, et al. 2023. “RTVis: Research Trend Visualization Toolkit.”

Singh, D’Arcy, Cohan, et al. 2022. “SciRepEval: A Multi-Format Benchmark for Scientific Document Representations.” In.

Wang, Fu, Du, et al. 2023. “Scientific Discovery in the Age of Artificial Intelligence.” Nature.