Improving peer review

Incentives for truth seeking at the micro scale and how they might be improved

2020-05-16 — 2026-02-09

Wherein review credits are required for submission, reviewer assignment is randomized for fairness, and disputes are adjudicated by bounded AI arbitration, whilst journals and ML venues are contrasted.

academe
agents
collective knowledge
economics
faster pussycat
game theory
how do science
incentive mechanisms
institutions
mind
networks
provenance
sociology
wonk

On designing peer review systems for validating scientific output.

Reputation systems, collective decision making, groupthink management, Bayesian elicitation, and other mechanism considerations for trustworthy evaluation of science, a.k.a. our collective knowledge of reality itself.

1 Understanding the Review Process

If we come from “classic” journals and wander into ML conferences, the ground rules really are different. If we’re coming from outside research, the whole thing may seem bizarre. Let’s unpack the two main variants of peer review that show up in my corner of the world; I’m not claiming this is representative of every scientific field, and it probably has little to do with the humanities.

1.1 Classic journals

Most journals run rolling submissions with editor triage (desk rejections), assignment to volunteer referees (often 2–3), and potentially multiple rounds over months to a year. Blinding varies by field (single- vs double-blind). A growing trend is transparent peer review, where reviewer reports (and sometimes author replies and editorial decisions) are published alongside the article—now standard at Nature for accepted papers, building on earlier opt-in pilots.

1.2 ML conferences

Figure 1: Vesaelius pioneers scientific review by peering (Credit: the University of Basel)

For example: NeurIPS, ICLR. The benchmark-heavy world of machine learning has a slightly different implied scientific method than the average empirical science field, so the details of peer review may differ (Hardt 2025).

The review process is different too. Big ML conferences run in batch, deadline-driven cycles with thousands of submissions. After a call for papers, there’s bidding/assignment, double-blind reviewing (3–6 reviewers), an author response window, and a discussion moderated by an area chair who apparently does not sleep for the duration of the discussion period. Final decisions are aggregated (read: declared) by senior chairs. Timelines are measured in weeks, not months. In ICLR, the whole thing plays out in public on OpenReview (public comments + author revisions during the discussion period), while NeurIPS runs standard double-blind review with AC-led discussions and rebuttals. See also current venue pages for concrete expectations—for example, NeurIPS reviewer responsibilities for a flavour.

TIL that OpenReview.net was based on a workshop paper proposal: Soergel, Saunders, and McCallum (2013):

Across a wide range of scientific communities, there is growing interest in accelerating and improving the progress of scholarship by making the peer review process more open. Multiple new publication venues and services are arising, especially in the life sciences, but each represents a single point in the multi-dimensional landscape of paper and review access for authors, reviewers and readers. In this paper, we introduce a vocabulary for describing the landscape of choices regarding open access, formal peer review, and public commentary. We argue that the opportunities and pitfalls of open peer review warrant experimentation in these dimensions, and discuss desiderata of a flexible system. We close by describing OpenReview.net, our web-based system in which a small set of flexible primitives support a wide variety of peer review choices, and which provided the reviewing infrastructure for the 2013 International Conference on Learning Representations. We intend this software to enable trials of different policies, in order to help scientific communities explore open scholarship while addressing legitimate concerns regarding confidentiality, attribution, and bias.

1.3 Hybrids and experiments

  • TMLR (Transactions on Machine Learning Research). Rolling submissions with journal-style reviewing aimed at shorter, conference-shaped ML papers; fast turnaround, double-blind, and public reviews on OpenReview once accepted. In practice, it sits between a conference and a traditional journal in cadence and openness. ([Journal of Machine Learning Research][3])
  • eLife’s “Reviewed Preprint” (since 2023). eLife reviews preprints and publishes public reviews and an editorial assessment; it doesn’t issue accept/reject decisions in the classic sense. This explicitly separates evaluation from gatekeeping.
  • F1000Research (and F1000). Publishes first, then reviews; peer review is fully open (reviewer names and reports), and versions can be updated in response. It’s the clearest example of post-publication, transparent review at scale.

2 Theories of Peer Review

These ecosystems imply different incentives and failure modes. Conferences optimize for speed and triage under load (good for timely results; vulnerable to assignment noise, miscalibration, and strategic behaviour). Journals optimize for depth and iteration (good for careful revisions; vulnerable to long delays and opacity, though transparency is improving).

Researchers have long attempted to understand the dynamics of the reviewing process through mathematical models (e.g., Cole, Jr, and Simon (1981); Lindsey (1988); Whitehurst (1984)). These models often explore how chance, bias, and reviewer reliability affect outcomes.

NeurIPS has made a habit of testing its own review process meta-scientifically, and of trying to improve it. This empirical turn has shifted the focus towards evidence-based interventions (see Nihar B. Shah et al. 2016; Ragone et al. 2013), and analyses of the 2014 NIPS experiment, (see also this). I am focusing on ML research here, but I think some of the ideas will carry over.

The Unjournal does some post-publication review that I’m super interested in:

The Unjournal is making research better by evaluating what really matters. We aim to make rigorous research more impactful and impactful research more rigorous.

The academic journal system is out of date, discourages innovation, and encourages rent-seeking.

The Unjournal is not a journal. We don’t publish research. Instead, we commission (and pay for) open, rigorous expert evaluation of publicly-hosted research. We make it easier for researchers to get feedback and credible ratings of their work, so they can focus on doing better research rather than journal-shopping.

We currently focus on quantitative work that informs global priorities, especially in economics, policy, and social science. We focus on what’s practically important to researchers, policy-makers, and the world.

Their process is documented online: recommended reading for anyone interested in peer review reform.

3 The Parts of Peer Review Mechanisms

A fundamental difficulty in peer review (as in, well every human system) is aligning the incentives of the participants with the goals of the system. Reviewing is often time-consuming and poorly compensated, either financially or professionally. Furthermore, social dynamics and status incentives—such as the potential social cost of rejecting a prominent author’s work—often shape reviewer behaviour more strongly than abstract commitments to quality.

Many attempts to improve peer review focus on process tweaks that fail to address this underlying payoff matrix. However, recent work in mechanism design suggests several structural approaches that might reorient incentives towards more effortful and truthful evaluations. These often involve (A) linking the roles of author and reviewer, (B) introducing randomness to deter manipulation, and (C) using sophisticated methods to elicit and score judgments.

The following sections explore these proposals and others derived from the literature that could be considered for an experimental journal design.

3.1 Linking Author and Reviewer Roles

One significant incentive problem is free-riding: authors benefit from the review system without contributing adequately to it.

One proposed solution involves formally linking the roles of author and reviewer. This is often framed technically as “admission control” in a one-sided market where everyone participates in both capacities. The core idea is to tie an author’s eligibility or priority for submission to their recent contributions to the review process. Recent models suggest that standard practices (like random matching) can lead to arbitrarily poor reviewer effort. In contrast, Admission Control (AC) mechanisms can theoretically achieve high welfare if effort is observable, and maintain significant benefits even when effort signals are noisy (Nihar B. Shah 2025; Xiao, Dörfler, and Schaar 2014). This concept is already influencing policy at some venues, which may desk-reject submissions from authors known to be irresponsible reviewers.

Implementation Considerations:

A practical implementation could involve a system of review credits . Submitting a paper might require a deposit of credits earned through recent, substantive reviewing. This avoids cash accounting but requires a mechanism to measure review quality, which can be noisy, and must accommodate new authors entering the community.

Alternatively, instead of hard bans for non-compliance, a journal might use stochastic gating. Papers from authors who have not met their review obligations could be routed to larger panels or require higher consensus for acceptance. This approach is softer and aligns with the AC model’s variants for noisy effort detection.

This looks punitive, but the theoretical grounding suggests that linking roles deters low-effort participation and can improve the precision of decisions for high-quality papers, leading to overall welfare improvements. This kind of system probably makes sense coupled with safeguards against identity fraud and collusion rings, which have been documented in some fields (Littman 2021).

FWIW I always wonder why my reviewer rank is not as well publicised as my citation count.

3.2 Strategic Use of Randomness

The process of assigning reviewers to papers is crucial, yet vulnerable. If the assignment process is too predictable, it can be manipulated. Authors might attempt to influence assignments to secure favourable reviewers, and reviewers might engage in “bidding” strategies to review specific papers. Furthermore, deterministic assignment processes can make it difficult to conduct clean experiments on the review process itself and may inadvertently compromise anonymity.

Introducing controlled randomness into reviewer assignment and process design can act as an prophylactic against these issues (Y. E. Xu et al. 2024; Jecmen et al. 2022; Stelmakh, Shah, and Singh 2021).

Specific Approaches:

Modern methods for randomized assignment aim to maintain high match quality (based on expertise) while incorporating randomness to improve robustness against manipulation and facilitate A/B testing of policies. NeurIPS piloted such an approach in 2024, reporting similar assignment quality with added benefits (NeurIPS 2024 postmortem).

In multi-phase review designs, randomly splitting the reviewer pool—saving a random subset for later phases—has been shown to be near-optimal for maintaining assignment similarity and enabling clean experiments (Jecmen et al. 2022).

Beyond randomness, the assignment objective itself can be reconsidered. Instead of maximizing total similarity (which might leave some papers with poor matches), one might aim for max-min review quality (“help the worst-off paper”). Algorithms like PeerReview4All attempt to optimize for fairness while maintaining competitive accuracy (Stelmakh, Shah, and Singh 2021).

While randomness might slightly reduce the quality of the single “best match” in some cases, the gains in robustness to collusion and the ability to evaluate the system often outweigh this cost (Nihar B. Shah et al. 2016; Nihar B. Shah 2025).

3.3 Post publication review and open discussion

TBD

3.4 Eliciting Better Judgments through Forecasting and Scoring

Traditional peer review often relies on simple scoring rubrics, which may not capture the nuance of a reviewer’s judgment or incentivize truthful reporting. An alternative approach is to ask reviewers for probabilistic forecasts about observable future outcomes and evaluate these forecasts using proper scoring rules](./calibration.qmd) and the other machinery of Bayesian elicitation.

Proper scoring rules (like the log or Brier score) are designed such that a reviewer maximizes their expected score by reporting their true beliefs (Gneiting and Raftery 2007). For example, reviewers could be asked to forecast the probability that an independent audit will replicate the main claim, or the probability that an arbiter will flag a major error.

Over time, as outcomes are observed, this approach allows the system to calibrate reviewers and identify those who are consistently informative. This can be combined with machine learning techniques to aggregate criteria-to-decision mappings, reducing the impact of idiosyncratic weights (Noothigattu, Shah, and Procaccia 2021).

Another interesting mechanism involves eliciting information from authors themselves. The You are the best reviewer of your own paper mechanism proposes eliciting an author-provided ranking among their own submissions. When combined with reviewer scores, this information seems to improve decision quality (Su 2021) (IMO surprisingly, the incentives seem off).

The primary challenge in implementing these methods is defining outcomes that can be observed within a reasonable timeframe (e.g., an internal audit verdict rather than long-term citations).

3.5 Calibration and aggregation of ratings

A persistent challenge in evaluating reviews is that reviewers use scoring rubrics (i.e. scores from 1 to 10 or whatever) differently. Although reviewers all use the same nominal scale when assessing a paper, their scores are often not well calibrated to one another. Some reviewers may be systematically harsher or more lenient. This is a familiar problem in survey methodology, where respondents’ scores are, at best, internally consistent.

Simply averaging the scores can be problematic, as it “bakes in” this miscalibration. Furthermore, as we know from social choice theory, simple averaging may not be the optimal way to aggregate different perspectives. The literature suggests several approaches that might improve the aggregation process:

  • Community-learned aggregation: Instead of assuming a fixed relationship between aspect scores (e.g., novelty, rigour) and the overall recommendation, we can learn a mapping (e.g., L(1,1)-style) that reflects the community’s preferences. This helps reduce the impact of individual reviewers’ idiosyncratic weighting schemes (Noothigattu, Shah, and Procaccia 2021).
  • Calibration with confidence: Asking reviewers to provide not just a score but also a confidence level can help in calibration. By analyzing confidence across the panel graph, it may be possible to adjust for differences in scale usage and leniency (MacKay et al. 2017).
  • Least-squares calibration: This framework offers a flexible approach to correcting for bias and noise beyond simple linear miscalibration (Tan et al. 2021). TODO: revisit this one — it sounds fun.

A simpler approach might involve collecting confidence signals and learning a post-hoc transformation to map each reviewer’s scores to a shared scale.

3.6 Eliciting Truth without Ground Truth

Rewarding high-quality reviews is difficult when there is no objective “ground truth” against which to measure them. Peer-prediction mechanisms, such as the Bayesian Truth Serum (Prelec 2004) and Peer Truth Serum (Radanovic, Faltings, and Jurca 2016), offer a theoretical way to incentivize honest and effortful reports by comparing reviewers’ reports against each other. These mechanisms have inspired concrete designs for peer review, including proposals for auction-funded reviewer payments scored via peer prediction (Srinivasan and Morgenstern 2023).

However, these mechanisms introduce significant conceptual and UX complexity and can be brittle under conditions of collusion or correlated errors. For these reasons, they may be challenging to implement initially. A lighter alternative might be to combine forecast scoring with post-hoc audits to generate observable outcomes.

3.7 LLM-Assisted Arbitration

The dialogue between authors and reviewers, particularly during rebuttals, is productive and helpful in principle, but inconsistent in practice. Meta-reviewers may struggle to adjudicate complex technical disputes efficiently.

A recent proposal suggests inserting a bounded-round arbitration step using AI adjudicators (Allen-Zhu and Xu 2025). The argument is that arbitration (analysing an existing dispute) may be cognitively easier than full reviewing (generating a novel critique), and that current language models may already have sufficient competence for this task (L1 or L2 competence) in some domains. See CF debate protocols and other interesting scalable-oversight-style alignment ideas.

In such a protocol, after authors submit a structured rebuttal to reviewer comments, an AI arbitrator (using a standardized prompt and fixed turns) evaluates specific contested points and issues a finding with a confidence score and citations to the text. Human meta-reviewers then consider this finding alongside the original human reviews. Key design features include bounded rounds of interaction, ensuring authors get the last word (to counter the advantages reviewers gain from anonymity), and logging all arbitrator chats for auditability (Allen-Zhu and Xu 2025).

The potential benefits include increased consistency, immediate logic checks, and a searchable audit trail. A neutral arbitrator may also lower the social costs (“face costs”) for reviewers and authors to admit errors. However, challenges include model variance, domain gaps (e.g., in mathematics-heavy areas), and the need for careful governance.

See also AI agents in science for more hands-on ideas about AI roles in the parts of research workflows other than review.

There are many variants on the theme of AI-assisted review, including AI-generated initial reviews, AI-summarized discussions, and AI-flagged statistical checks.

See also

(Couto et al. 2024; Kim, Lee, and Lee 2025; Kuznetsov et al. 2024; Liang et al. 2024; R. Liu and Shah 2023; Lu et al. 2024; Ye et al. 2024)

3.8 Addressing Collusion and Identity Fraud

The integrity of the peer review process is threatened by various forms of manipulation. Documented issues include collusion rings (where authors agree to review each other’s papers favourably) and targeted assignment gaming (Littman 2021). Identity theft of reviewer accounts has also become a concern; a recent report uncovered 94 fake reviewer profiles on OpenReview (Nihar B. Shah 2025).

Several design features can help mitigate these risks:

  1. Randomness in assignments significantly reduces the probability that targeted attempts to influence assignments will succeed (Jecmen et al. 2020; Y. E. Xu et al. 2024).
  2. Cycle-free assignment constraints prohibit simple reciprocal arrangements (e.g., cycles of length ≥2 among author—reviewer pairs) (Boehmer, Bredereck, and Nichterlein 2022).
  3. Enhanced identity verification for reviewers (e.g., ORCID, verifiable employment, or one-time checks) can help combat fake profiles.

More advanced proposals include pseudonymous persistent reviewer IDs with public reputation scores, linked to admission-control credits, although these are more complex to implement.

3.9 Automated Screening and Checks

Automated tools can supplement the human review process by screening for common issues.

  • Statistical auditing: Tools like Statcheck can automatically parse manuscripts and flag basic statistical inconsistencies. A natural experiment suggests that implementing such checks in the workflow can lead to large reductions in reporting errors (Nuijten and Wicherts 2024).
  • Reproducibility checks: For computational work, journals can require authors to submit an “auditability pack” (e.g., environment details, code, tests) and run basic execution checks upon submission.

It’s (morally speaking) important to treat the output of these tools as a triage signal for editors and reviewers, rather than a basis for automatic rejection, as false positives are expected. Whether these tools produce fewer false positives than humans is an empirical question.

3.10 Anonymity and Bias

The debate over anonymity in peer review (single-blind vs. double-blind vs. open review) is ongoing, with various trade-offs well-known by now:

  • Empirical evidence suggests that double-blind review can reduce prestige bias in some settings (Okike et al. 2016).
  • Open review (signed) has mixed evidence regarding its impact on review quality and can reduce reviewer willingness to participate (van Rooyen et al. 1999).
  • Concerns about herding in discussions (where early comments overly influence later ones) were not supported by a large randomized controlled trial at ICML, which found no evidence that the first discussant’s stance determined the outcome (Stelmakh, Rastogi, Shah, et al. 2023).
  • Observational evidence suggests a potential citation bias, where reviewers may respond more favorably when their own work is cited by the authors (Stelmakh, Rastogi, Liu, et al. 2023).

Given these findings, a hybrid approach might work. We could maintain double-blind conditions during initial reviews, then reveal identities after initial scores are submitted. This “middle ground” aims to balance reducing bias with the benefits of contextual reading and COI checks (Nihar B. Shah 2025).

4 Designing an Experimental Journal

Thought experiment: what should we do if we launched a hypothetical new green-field journal, assuming time and money were no object?

Translating these theoretical proposals and experimental findings into the design of a new journal requires careful consideration of implementation costs and the specific goals of the journal. A pragmatic approach might involve layering several interventions that are compatible with a small editorial team and limited budget, and running incremental experiments to evaluate their impact.

Based on the literature surveyed, several components appear promising for an initial pilot:

  1. Incentive Alignment via Admission Control: Implementing a review credit system to link author submissions with review contributions. Using stochastic penalties (e.g., heavier scrutiny for non-compliant authors) rather than hard bans is crucial when effort measurement is noisy.
  2. Robust and Fair Assignment: Adopting randomized assignment methods with a fairness objective can improve robustness to manipulation and ensure equitable distribution of review quality (Stelmakh, Shah, and Singh 2021; Y. E. Xu et al. 2024). This requires a one-time engineering effort.
  3. Improved Evaluation Metrics: Moving beyond simple averaging by incorporating calibrated aggregation and forecast questions with proper scoring rules. Collecting aspect scores and confidence levels allows for post-hoc calibration (Noothigattu, Shah, and Procaccia 2021; MacKay et al. 2017). Forecasts can be rewarded with review credits (converted to benefits like fee waivers) rather than cash.
  4. Hybrid Anonymity: Employing a “middle-ground reveal” —blind through initial reviews, then revealing identities—to balance bias reduction with contextual checks (Nihar B. Shah 2025).
  5. Supplemental Checks: Integrating automated triage for statistical or reproducibility checks (Nuijten and Wicherts 2024). Potentially introducing a bounded AI arbitration step to assist meta-reviewers in resolving specific disputes (Allen-Zhu and Xu 2025).
  6. Anti-Collusion Measures: Enforcing cycle-free constraints and randomized assignments (Jecmen et al. 2020; Boehmer, Bredereck, and Nichterlein 2022).
  7. Cash Payments for Reviews. Direct cash payments or cryptocurrency mechanisms (DOGE 2.0 vision (Allen-Zhu and Xu 2025)) involve complex accounting and policy issues. Starting with a credits-for-benefits system sounds more practical. But cash is immediately understandable and fungible, so it might be worth piloting in a small setting.
  8. Lotteries for Acceptance. Pure lotteries for acceptance, while used by some funding agencies with mixed results (Feliciani, Luo, and Shankar 2024; M. Liu et al. 2020), may be controversial for journals. Improving incentives and calibration should probably take precedence, although a lottery within a narrow “tie band” might be worthwhile.

5 Incoming

Figure 3

6 References

Aksoy, Yanik, and Amasyali. 2023. Reviewer Assignment Problem: A Systematic Review of the Literature.” Journal of Artificial Intelligence Research.
Allen-Zhu, and Xu. 2025. DOGE: Reforming AI Conferences and Towards a Future Civilization of Fairness and Justice.” SSRN Scholarly Paper.
Becerril, Bjørnshauge, Bosman, et al. 2021. The OA Diamond Journals Study, Part 2: Recommendations.” Copyright, Fair Use, Scholarly Communication, Etc.
Boehmer, Bredereck, and Nichterlein. 2022. Combating Collusion Rings Is Hard but Possible.” In Proceedings of the AAAI Conference on Artificial Intelligence.
Bosman, Frantsvåg, Kramer, et al. 2021. The OA Diamond Journals Study. Part 1: Findings.”
Budish, Che, Kojima, et al. 2009. “Implementing Random Assignments: A Generalization of the Birkhoff-von Neumann Theorem.” In Cowles Summer Conference.
Charlin, and Zemel. 2013. The Toronto Paper Matching System: An Automated Paper-Reviewer Assignment System.”
Charlin, Zemel, and Boutilier. 2011. A Framework for Optimizing Paper Matching.” In UAI2011.
Cole, Jr, and Simon. 1981. Chance and Consensus in Peer Review.” Science.
Cousins, Payan, and Zick. 2023. Into the Unknown: Assigning Reviewers to Papers with Uncertain Affinities.” In Algorithmic Game Theory.
Couto, Ho, Kumari, et al. 2024. RelevAI-Reviewer: A Benchmark on AI Reviewers for Survey Paper Relevance.”
Deligkas, and Filos-Ratsikas. 2023. Algorithmic Game Theory: 16th International Symposium, SAGT 2023, Egham, UK, September 4–7, 2023, Proceedings.
Eve, Neylon, O’Donnell, et al. 2021. Reading Peer Review: PLOS ONE and Institutional Change in Academia.” Elements in Publishing and Book Culture.
Faltings, Jurca, and Radanovic. 2017. Peer Truth Serum: Incentives for Crowdsourcing Measurements and Opinions.”
Feliciani, Luo, and Shankar. 2024. Funding Lotteries for Research Grant Allocation: An Extended Taxonomy and Evaluation of Their Fairness.” Research Evaluation.
Fernandes, Siderius, and Singal. 2025. Peer Review Market Design: Effort-Based Matching and Admission Control.” SSRN Scholarly Paper.
Flach, Spiegler, Golénia, et al. 2010. Novel Tools to Streamline the Conference Review Process: Experiences from SIGKDD’09.” SIGKDD Explor. Newsl.
FreundYoav, IyerRaj, E, et al. 2003. An Efficient Boosting Algorithm for Combining Preferences.” The Journal of Machine Learning Research.
Gasparyan, Gerasimov, Voronov, et al. 2015. Rewarding Peer Reviewers: Maintaining the Integrity of Science Communication.” Journal of Korean Medical Science.
Gneiting, and Raftery. 2007. Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association.
Goldberg, Stelmakh, Cho, et al. 2025. Peer Reviews of Peer Reviews: A Randomized Controlled Trial and Other Experiments.” PLOS ONE.
Goldsmith, and Sloan. 2007. The AI Conference Paper Assignment Problem.” In Proc. AAAI Workshop on Preference Handling for Artificial Intelligence.
Hardt. 2025. The Emerging Science of Machine Learning Benchmarks.
Jecmen, Zhang, Liu, et al. 2020. Mitigating Manipulation in Peer Review via Randomized Reviewer Assignments.” In Advances in Neural Information Processing Systems.
Jecmen, Zhang, Liu, et al. 2022. Near-Optimal Reviewer Splitting in Two-Phase Paper Reviewing and Conference Experiment Design.” In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. AAMAS ’22.
Kim, Lee, and Lee. 2025. Position: The AI Conference Peer Review Crisis Demands Author Feedback and Reviewer Rewards.”
Kuznetsov, Afzal, Dercksen, et al. 2024. What Can Natural Language Processing Do for Peer Review? In CoRR.
Liang, Zhang, Cao, et al. 2024. Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis.” NEJM AI.
Lindsey. 1988. Assessing Precision in the Manuscript Review Process: A Little Better Than a Dice Roll.” Scientometrics.
Littman. 2021. Collusion Rings Threaten the Integrity of Computer Science Research.” Communications of the ACM.
Liu, Mengyao, Choy, Clarke, et al. 2020. The Acceptability of Using a Lottery to Allocate Research Funding: A Survey of Applicants.” Research Integrity and Peer Review.
Liu, Ryan, and Shah. 2023. ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing.”
Liu, Xiang, Suel, and Memon. 2014. A Robust Model for Paper Reviewer Assignment.” In Proceedings of the 8th ACM Conference on Recommender Systems.
Luce. 2001. “Reduction Invariance and Prelec ’ s Weighting Functions.” Journal of Mathematical Psychology.
Lu, Lu, Lange, et al. 2024. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery.”
MacKay, Kenna, Low, et al. 2017. Calibration with Confidence: A Principled Method for Panel Assessment.” Royal Society Open Science.
Marcoci, Vercammen, Bush, et al. 2022. Reimagining Peer Review as an Expert Elicitation Process.” BMC Research Notes.
Merrifield, and Saari. 2009. Telescope Time Without Tears: A Distributed Approach to Peer Review.” Astronomy & Geophysics.
Mimno, and McCallum. 2007. Expertise Modeling for Matching Papers with Reviewers.” In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’07.
Noothigattu, Shah, and Procaccia. 2021. Loss Functions, Axioms, and Peer Review.” Journal of Artificial Intelligence Research.
Nuijten, and Wicherts. 2024. Implementing Statcheck During Peer Review Is Related to a Steep Decline in Statistical-Reporting Inconsistencies.” Advances in Methods and Practices in Psychological Science.
Okike, Hug, Kocher, et al. 2016. Single-Blind Vs Double-Blind Peer Review in the Setting of Author Prestige.” JAMA.
Potts, Hartley, Montgomery, et al. 2016. A Journal Is a Club: A New Economic Model for Scholarly Publishing.” SSRN Scholarly Paper.
Prelec. 2004. A Bayesian Truth Serum for Subjective Data.” Science.
Prelec, Seung, and McCoy. 2017. A Solution to the Single-Question Crowd Wisdom Problem.” Nature.
Radanovic, and Faltings. 2013. A Robust Bayesian Truth Serum for Non-Binary Signals.” In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. AAAI’13.
Radanovic, Faltings, and Jurca. 2016. Incentives for Effort in Crowdsourcing Using the Peer Truth Serum.” ACM Transactions on Intelligent Systems and Technology.
Ragone, Mirylenka, Casati, et al. 2013. On Peer Review in Computer Science: Analysis of Its Effectiveness and Suggestions for Improvement.” Scientometrics.
Rodriguez, and Bollen. 2008. An Algorithm to Determine Peer-Reviewers.” In Proceedings of the 17th ACM Conference on Information and Knowledge Management. CIKM ’08.
Ross-Hellauer, Deppe, and Schmidt. 2017. Survey on Open Peer Review: Attitudes and Experience Amongst Editors, Authors and Reviewers.” PLOS ONE.
Shah, Nihar B. 2022. Challenges, Experiments, and Computational Solutions in Peer Review.” Communications of the ACM.
Shah, Nihar B. 2025. An Overview of Challenges, Experiments, and Computational Solutions in Peer Review (Extended Version).”
Shah, Nihar B, Tabibian, Muandet, et al. 2016. “Design and Analysis of the NIPS 2016 Review Process.”
Smith. 2006. Peer Review: A Flawed Process at the Heart of Science and Journals.” Journal of the Royal Society of Medicine.
Soergel, Saunders, and McCallum. 2013. Open Scholarship and Peer Review: A Time for Experimentation.”
Srinivasan, and Morgenstern. 2023. Auctions and Peer Prediction for Academic Peer Review.”
Stelmakh, Rastogi, Liu, et al. 2023. Cite-Seeing and Reviewing: A Study on Citation Bias in Peer Review.” PLOS ONE.
Stelmakh, Rastogi, Shah, et al. 2023. A Large Scale Randomized Controlled Trial on Herding in Peer-Review Discussions.” PLOS ONE.
Stelmakh, Shah, and Singh. 2021. PeerReview4All: Fair and Accurate Reviewer Assignment in Peer Review.” Journal of Machine Learning Research.
Su. 2021. You Are the Best Reviewer of Your Own Papers: An Owner-Assisted Scoring Mechanism.” In Advances in Neural Information Processing Systems.
Tang, Tang, and Tan. 2010. Expertise Matching via Constraint-Based Optimization.” In 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.
Tan, Wu, Bei, et al. 2021. Least Square Calibration for Peer Reviews.” In Proceedings of the 35th International Conference on Neural Information Processing Systems. Nips ’21.
Taylor. 2008. “On the Optimal Assignment of Conference Papers to Reviewers.”
Thurner, and Hanel. 2010. “Peer-Review in a World with Rational Scientists: Toward Selection of the Average.”
Tran, Cabanac, and Hubert. 2017. Expert Suggestion for Conference Program Committees.” In 2017 11th International Conference on Research Challenges in Information Science (RCIS).
van Rooyen, Godlee, Evans, et al. 1999. Effect of Open Peer Review on Quality of Reviews and on Reviewers’ Recommendations: A Randomised Trial.” BMJ : British Medical Journal.
Vijaykumar. 2020. “Potential Organized Fraud in ACM.”
Ward, and Kumar. 2008. Asymptotically Optimal Admission Control of a Queue with Impatient Customers.” Mathematics of Operations Research.
Whitehurst. 1984. Interrater Agreement for Journal Manuscript Reviews.” American Psychologist.
Wu, Xu, Guo, et al. 2025. An Isotonic Mechanism for Overlapping Ownership.”
Xiao, Dörfler, and Schaar. 2014. Incentive Design in Peer Review: Rating and Repeated Endogenous Matching.”
Xiao, Dörfler, and van der Schaar. 2014. Rating and Matching in Peer Review Systems.” In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).
Xu, Yixuan Even, Jecmen, Song, et al. 2024. “A One-Size-Fits-All Approach to Improving Randomness in Paper Assignment.” In Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23.
Xu, Yichong, Zhao, and Shi. 2017. “Mechanism Design for Paper Review.”
Ye, Pang, Chai, et al. 2024. Are We There yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review.” In CoRR.