AI Agents for scientific knowledge discovery and generation

Outsourcing knowledge of base reality to bots

2023-01-22 — 2025-10-19

Wherein a survey of AI agents for scientific discovery is presented, and OpenScholar’s 45 million‑paper datastore with citation‑backed retrieval is noted as enabling evidence‑grounded synthesis.

academe
collective knowledge
faster pussycat
how do science
institutions
mind
networks
provenance
sociology
wonk
Figure 1

A list of attempts and approaches to make scientific knowledge management and discovery — for science in particular — work via generative AI.

1 FutureHouse Platform

Fresh off the rack — it looks interesting. It synthesizes existing literature and identifies research gaps and areas of comparative advantage.

2 AI2

Figure 2: Overview of the OpenScholar pipeline: OpenScholar leverages a large-scale datastore consisting of 45 million papers and uses a custom-trained retriever, reranker, and 8B parameter language model to answer questions based on up-to-date scientific literature (from Asai et al. 2024)

Ai2 OpenScholar: Scientific literature synthesis with retrieval-augmented language models | Ai2/ Source provides infrastructure for search via vector embeddings specialized for science papers.

Dissatisfied with the crappy search for ICLR, I built a quick semantic search using Ai2’s system. It was really easy and worked really well! You can find it here: danmackinlay/openreview_finder. It takes about 10 minutes to download and index the ICLR 2025 papers.

Ai2’s current stack for scientific workflows seems to centre on Asta and OpenScholar, two complementary efforts that lean on large scholarly corpora to keep answers grounded and citable. Asta acts as a broader platform (agents, benchmarks, developer resources), while OpenScholar focuses on retrieval-augmented, citation-backed literature synthesis with strong evaluation results.

2.1 Asta: USPs in practice

  • Agentic research ecosystem: an AI research assistant, the AstaBench suite, and developer resources designed for transparency, reproducibility, and open development (blog post).
  • Trust-first framing: launched as a “standard for trustworthy AI agents in science,” emphasizing verifiable, source-traceable outputs and an open framework.
  • Practical capabilities: the assistant offers LLM-powered paper discovery, literature synthesis with clickable citations, and early-stage data analysis (in beta for partners).
  • Benchmarks with receipts: AstaBench provides rigorous, holistic evaluation of scientific agents, aiming to clarify performance beyond anecdotes.
  • Scholar QA has been folded into Asta (now “Summarize literature”), signalling a single, unified assistant surface.

2.2 OpenScholar: USPs in practice

(Asai et al. 2024)

  • Purpose-built RAG for science: OpenScholar answers research questions by retrieving passages from a very large scientific datastore and synthesizing citation-backed responses, with an iterative self-feedback loop to improve quality.
  • Scale of sources: The OpenScholar DataStore (OSDS) spans tens of millions of open-access papers and hundreds of millions of passage embeddings, enabling broad coverage across domains.
  • Evaluation signal: As far as we can tell, OpenScholar reports higher correctness than larger closed-source models on ScholarQABench while keeping citation accuracy at expert-like levels and reducing fabricated references.
  • Open ecosystem: The code, models, datastore, and a public demo are open-sourced, which makes it straightforward to inspect and extend the full pipeline.

2.3 How they fit together

Asta provides the broader scaffolding (agents, benchmarks, standards) for trustworthy scientific AI, while OpenScholar is a specialised literature-synthesis system that plugs into that ecosystem and supports evidence-grounded workflows. Both emphasize citing sources, enabling reproducible, inspectable reasoning steps, and providing useful open tech-stack tools for researchers and developers.

3 Elicit

Elicit: The AI Research Assistant uses large language models to solve this problem:

Elicit uses language models to help you automate research workflows, like parts of literature review.

Elicit can find relevant papers without perfect keyword match, summarise takeaways from the paper specific to your question, and extract key information from the papers.

While answering questions with research is the main focus of Elicit, there are also other research tasks that help with brainstorming, summarisation, and text classification.

4 Consensus

Search - Consensus: AI Search Engine for Research

Consensus is the AI-powered academic search engine

Search & analyze 200M+ peer-reviewed research papers

4.1 OpenAI Deep Research

Introducing Deep Research from OpenAI.

An agent that uses reasoning to synthesise large amounts of online information and complete multi-step research tasks for you.

4.2 Perplexity AI

It also offers a deep research feature.

5 Tool Universe

Once the setup is complete, the AI scientist operates as follows: given a user instruction or task, it formulates a plan or hypothesis, employs the tool finder in ToolUniverse to identify relevant tools, and iteratively applies these tools to gather information, conduct experiments, verify hypotheses, and request human feedback when necessary. For each required tool call, the AI scientist generates arguments that conform to the ToolUniverse protocol, after which ToolUniverse executes the tool and returns the results for further reasoning.

6 SciLire

SciLire: Science Literature Review AI Tool from CSIRO.

I just saw a tech demo from my colleagues. It looks promising for high-speed, AI-augmented literature review. On the other hand, CSIRO — like most Australian tech projects — isn’t well resourced, so my optimism is tempered.

6.1 ResearchRabbit

ResearchRabbit:

  • Spotify for Papers: Just like in Spotify, you can add papers to collections. ResearchRabbit learns what you love and improves its recommendations!
  • Personalised Digests: Keep up with the latest papers related to your collections! If we’re not confident something’s relevant, we don’t email you—no spam!
  • Interactive Visualisations: Visualise networks of papers and co-authorships. Use graphs as new “jumping off points” to dive even deeper!
  • Explore Together: Collaborate on collections, or help kickstart someone’s search process! And leave comments as well!

6.2 scite

scite: See how research has been cited

Citations are classified by a deep learning model that is trained to identify three categories of citation statements: those that provide contrasting or supporting evidence for the cited work, and others, which mention the cited study without providing evidence for its validity. Citations are classified by rhetorical function, not positive or negative sentiment.

  • Citations are not classified as supporting or contrasting by positive or negative keywords.
  • A Supporting citation can have a negative sentiment and a Contrasting citation can have a positive sentiment. Sentiment and rhetorical function are not correlated.
  • Supporting and Contrasting citations do not necessarily indicate that the exact set of experiments was performed. For example, if a paper finds that drug X causes phenomenon Y in mice and a subsequent paper finds that drug X causes phenomenon Y in yeast but both come to this conclusion with different experiments—this would be classified as a supporting citation, even though identical experiments were not performed.
  • Citations that simply use the same method, reagent, or software are not classified as supporting. To identify methods citations, you can filter by the section.

For full technical details including exactly how we do classification, what classifications and classification confidence mean, please read our recent publication describing how scite was built: (Nicholson et al. 2021)/

7 Incoming

8 References

Asai, He, Shao, et al. 2024. OpenScholar: Synthesizing Scientific Literature with Retrieval-Augmented LMs.”
Beltagy, Lo, and Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text.”
Borondo, Borondo, Rodriguez-Sickert, et al. 2014. To Each According to Its Degree: The Meritocracy and Topocracy of Embedded Markets.” Scientific Reports.
Channing, and Ghosh. 2025. AI for Scientific Discovery Is a Social Problem.”
Cohan, Feldman, Beltagy, et al. 2020. SPECTER: Document-Level Representation Learning Using Citation-Informed Transformers.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
Coscia, and Vandeweerdt. 2022. Posts on Central Websites Need Less Originality to Be Noticed.” Scientific Reports.
Gao, Ju, Jiang, et al. 2025. A Semantic Search Engine for Mathlib4.”
Kang, Zhang, Jiang, et al. 2024. Taxonomy-Guided Semantic Indexing for Academic Paper Search.”
Nicholson, Mordaunt, Lopez, et al. 2021. Scite: A Smart Citation Index That Displays the Context of Citations and Classifies Their Intent Using Deep Learning.” Quantitative Science Studies.
Shen, Lin, Zhang, et al. 2023. RTVis: Research Trend Visualization Toolkit.”
Singh, D’Arcy, Cohan, et al. 2022. SciRepEval: A Multi-Format Benchmark for Scientific Document Representations.” In.
Wang, Fu, Du, et al. 2023. Scientific Discovery in the Age of Artificial Intelligence.” Nature.