Improving peer review

Incentives for truth seeking at the micro scale and how they might be improved

2020-05-16 — 2025-10-13

Wherein the design of peer review is considered; author review‑credit admission control and randomized reviewer assignment are proposed, and AI arbitration for contested points is described.

academe

agents

collective knowledge

economics

faster pussycat

game theory

how do science

incentive mechanisms

institutions

mind

networks

provenance

sociology

wonk

On designing peer review systems for validating scientific output.

Reputation systems, collective decision making, groupthink management, Bayesian elicitation and other mechanism considerations for trustworthy evaluation of science, a.k.a. our collective knowledge of reality itself.

1 Understanding the Review Process

If you come from “classic” journals and wander into ML conferences, the ground rules really are different. If you are from the world outside research, the whole thing may seem bizarre. Let us unpack the two main variants of peer that impinge upon my personal world; I make no claim that this is representative of every scientific field, and it probably has little to do with the humanities.

1.1 Classic journals

Most journals run rolling submissions with editor triage (desk-rejects), assignment to volunteer referees (often 2–3), and potentially multiple rounds over months to a year. Blinding varies by field (single- vs double-blind). A growing trend is transparent peer review, where reviewer reports (and sometimes author replies and editorial decisions) are published alongside the article—now standard at Nature for accepted papers, extending earlier opt-in pilots.

1.2 ML conferences

Figure 1: Vesaelius pioneers scientific review by peering (Credit: the University of Basel)

e.g., NeurIPS, ICLR. The benchmark-heavy world of machine learning has a slightly different implied scientific method than your average empirical science field, so the details of peer review may differ (Hardt 2025).

The review process is rather different too. They run on a batch, deadline-driven cycle with thousands of submissions. After a call for papers, there’s bidding/assignment, double-blind reviewing (3–6 reviewers), an author response window, and a discussion moderated by an area chair who apparently does not sleep for the duration Final decisions are aggregated (read: declared) by senior chairs. Timelines are measured in weeks, not months. In ICLR, the whole thing plays out in public on OpenReview (public comments + author revisions during the discussion period), while NeurIPS runs standard double-blind review with AC-led discussions and rebuttals. See also current venue pages for concrete expectations: e.g. NeurIPS reviewer responsibilities for a flavour.

1.3 Hybrids and experiments

TMLR (Transactions on Machine Learning Research). Rolling submissions with journal-style reviewing aimed at the shorter, conference-shaped ML paper; fast turnarounds, double-blind, and public reviews on OpenReview once accepted. In practice it sits between a conference and a traditional journal in cadence and openness. ([Journal of Machine Learning Research][3])
eLife’s “Reviewed Preprint” (since 2023). eLife reviews preprints and publishes public reviews + an editorial assessment; it does not issue accept/reject in the classic sense. This explicitly separates evaluation from gatekeeping.
F1000Research (and F1000). Publishes first, then reviews; peer review is fully open (reviewer names and reports), and versions can be updated in response. It’s the clearest example of post-publication, transparent review at scale.

2 Theories of Peer Review

These ecosystems imply different incentives and failure modes. Conferences optimize for speed and triage under load (good for timely results; vulnerable to assignment noise, miscalibration, and strategic behaviour). Journals optimize for depth and iteration (good for careful revisions; vulnerable to long delays and opacity, though transparency is improving).

Researchers have long attempted to understand the dynamics of the reviewing process through mathematical models (e.g., Cole, Jr, and Simon (1981); Lindsey (1988); Whitehurst (1984)). These models often explore how chance, bias, and reviewer reliability affect outcomes.

NeuIPS has made a habit of testing its own review process meta-scientifically, trying to optimise its own review process This empirical turn has shifted the focus towards evidence-based interventions (see Nihar B. Shah et al. 2016; Ragone et al. 2013), and analyses of the 2014 NIPS experiment, (see also this). I am focusing on ML research here, but I think some of the ideas will carry over.

3 Incentive design

A fundamental difficulty in peer review is aligning the incentives of the participants with the goals of the system. Reviewing is often time-consuming and poorly compensated, either financially or professionally. Furthermore, social dynamics and status incentives—such as the potential social cost of rejecting a prominent author’s work—often shape reviewer behaviour more strongly than abstract commitments to quality.

Many attempts to improve peer review focus on process tweaks that fail to address this underlying payoff matrix. However, recent work in mechanism design suggests several structural approaches that might reorient incentives towards more effortful and truthful evaluations. These often involve (A) linking the roles of author and reviewer, (B) introducing randomness to deter manipulation, and (C) using sophisticated methods to elicit and score judgments.

The following sections explore these proposals and others derived from the literature that could be considered for an experimental journal design.

4 Linking Author and Reviewer Roles

One significant incentive problem is free-riding: authors benefit from the review system without contributing adequately to it.

One proposed solution involves formally linking the roles of author and reviewer. This is often framed technically as “admission control” in a one-sided market where everyone participates in both capacities. The core idea is to tie an author’s eligibility or priority for submission to their recent contributions to the review process. Recent models suggest that standard practices (like random matching) can lead to arbitrarily poor reviewer effort. In contrast, Admission Control (AC) mechanisms can theoretically achieve high welfare if effort is observable, and maintain significant benefits even when effort signals are noisy (Nihar B. Shah 2025; Xiao, Dörfler, and Schaar 2014). This concept is already influencing policy at some venues, which may desk-reject submissions from authors known to be irresponsible reviewers.

Implementation Considerations:

A practical implementation could involve a system of review credits . Submitting a paper might require a deposit of credits earned through recent, substantive reviewing. This avoids cash accounting but requires a mechanism to measure review quality, which can be noisy, and must accommodate new authors entering the community.

Alternatively, instead of hard bans for non-compliance, a journal might use stochastic gating. Papers from authors who have not met their review obligations could be routed to larger panels or require higher consensus for acceptance. This approach is softer and aligns with the AC model’s variants for noisy effort detection.

This looks punitive, but the theoretical grounding suggests that linking roles deters low-effort participation and can improve the precision of decisions for high-quality papers, leading to overall welfare improvements. This kind of system probably makes sense coupled with safeguards against identity fraud and collusion rings, which have been documented in some fields (Littman 2021).

FWIW I always wonder why my reviewer rank is not as well publicised as my citation count.

5 The Strategic Use of Randomness

The process of assigning reviewers to papers is crucial, yet vulnerable. If the assignment process is too predictable, it can be manipulated. Authors might attempt to influence assignments to secure favourable reviewers, and reviewers might engage in “bidding” strategies to review specific papers. Furthermore, deterministic assignment processes can make it difficult to conduct clean experiments on the review process itself and may inadvertently compromise anonymity.

Introducing controlled randomness into reviewer assignment and process design can act as an prophylactic against these issues (Y. E. Xu et al. 2024; Jecmen et al. 2022; Stelmakh, Shah, and Singh 2021).

Specific Approaches:

Modern methods for randomized assignment aim to maintain high match quality (based on expertise) while incorporating randomness to improve robustness against manipulation and facilitate A/B testing of policies. NeurIPS piloted such an approach in 2024, reporting similar assignment quality with added benefits (NeurIPS 2024 postmortem).

In multi-phase review designs, randomly splitting the reviewer pool—saving a random subset for later phases—has been shown to be near-optimal for maintaining assignment similarity and enabling clean experiments (Jecmen et al. 2022).

Beyond randomness, the assignment objective itself can be reconsidered. Instead of maximizing total similarity (which might leave some papers with poor matches), one might aim for max-min review quality (“help the worst-off paper”). Algorithms like PeerReview4All attempt to optimize for fairness while maintaining competitive accuracy (Stelmakh, Shah, and Singh 2021).

While randomness might slightly reduce the quality of the single “best match” in some cases, the gains in robustness to collusion and the ability to evaluate the system often outweigh this cost (Nihar B. Shah et al. 2016; Nihar B. Shah 2025).

6 Eliciting Better Judgments through Forecasting and Scoring

Traditional peer review often relies on simple scoring rubrics, which may not capture the nuance of a reviewer’s judgment or incentivize truthful reporting. An alternative approach is to ask reviewers for probabilistic forecasts about observable future outcomes and evaluate these forecasts using proper scoring rules](./calibration.qmd) and the other machinery of Bayesian elicitation.

Proper scoring rules (like the log or Brier score) are designed such that a reviewer maximizes their expected score by reporting their true beliefs (Gneiting and Raftery 2007). For example, reviewers could be asked to forecast the probability that an independent audit will replicate the main claim, or the probability that an arbiter will flag a major error.

Over time, as outcomes are observed, this approach allows the system to calibrate reviewers and identify those who are consistently informative. This can be combined with machine learning techniques to aggregate criteria-to-decision mappings, reducing the impact of idiosyncratic weights (Noothigattu, Shah, and Procaccia 2021).

Another interesting mechanism involves eliciting information from authors themselves. The You are the best reviewer of your own paper mechanism proposes eliciting an author-provided ranking among their own submissions. When combined with reviewer scores, this information seems to improve decision quality (Su 2022) (IMO surprisingly, the incentives seem off).

The primary challenge in implementing these methods is defining outcomes that can be observed within a reasonable timeframe (e.g., an internal audit verdict rather than long-term citations).

7 AI-Assisted Arbitration

Figure 2: DOGE for AI (Allen-Zhu and Xu 2025)

The dialogue between authors and reviewers, particularly during rebuttals, is, in principle, productive and helpful, but in practice it’s inconsistent. Meta-reviewers may struggle to adjudicate complex technical disputes efficiently.

A recent proposal suggests inserting a bounded-round arbitration step using AI adjudicators (Allen-Zhu and Xu 2025). The argument is that arbitration (analysing an existing dispute) may be cognitively easier than full reviewing (generating a novel critique), and current language models may already possess sufficient competence for this task (L1 or L2 competence) in some domains. See CF debate protocols and other interesting scalable-oversight-type alignment ideas.

In such a protocol, after authors submit a structured rebuttal to reviewer comments, an AI arbitrator (using a standardized prompt and fixed turns) evaluates specific contested points and issues a finding with a confidence score and citations to the text. Human meta-reviewers then consider this finding alongside the human reviews. Key design features include bounded rounds of interaction, ensuring authors get the last word (to counter the advantages reviewers gain from anonymity), and logging all arbiter chats for auditability (Allen-Zhu and Xu 2025).

The potential benefits include increased consistency, immediate logic checks, and a searchable audit trail. A neutral arbiter may also lower the social costs (“face costs”) for reviewers and authors to admit errors. However, challenges include model variance, domain gaps (e.g., in mathematics-heavy areas), and the need for careful governance.

See also AI agents in science for more intimate ideas about AI roles in the parts of research workflows other than review.

8 Addressing Collusion and Identity Fraud

The integrity of the peer review process is threatened by various forms of manipulation. Documented issues include collusion rings (where authors agree to review each other’s papers favourably) and targeted assignment gaming (Littman 2021). Identity theft of reviewer accounts has also become a concern; a recent report uncovered 94 fake reviewer profiles on OpenReview (Nihar B. Shah 2025).

Several design features can help mitigate these risks:

Randomness in assignments significantly reduces the probability that targeted attempts to influence assignments will succeed (Jecmen et al. 2020; Y. E. Xu et al. 2024).
Cycle-free assignment constraints prohibit simple reciprocal arrangements (e.g., cycles of length ≥2 among author—reviewer pairs) (Boehmer, Bredereck, and Nichterlein 2022).
Enhanced identity verification for reviewers (e.g., ORCID, verifiable employment, or one-time checks) can help combat fake profiles.

More advanced proposals include pseudonymous persistent reviewer IDs with public reputation scores, linked to admission-control credits, although these are more complex to implement.

9 Anonymity and Bias

The debate over anonymity in peer review (single-blind vs. double-blind vs. open review) is ongoing, with various trade-offs well-known by now:

Empirical evidence suggests that double-blind review can reduce prestige bias in some settings (Okike et al. 2016).
Open review (signed) has mixed evidence regarding its impact on review quality and can reduce reviewer willingness to participate (van Rooyen et al. 1999).
Concerns about herding in discussions (where early comments overly influence later ones) were not supported by a large randomized controlled trial at ICML, which found no evidence that the first discussant’s stance determined the outcome (Stelmakh, Rastogi, Shah, et al. 2023).
Observational evidence suggests a potential citation bias, where reviewers may respond more favorably when their own work is cited by the authors (Stelmakh, Rastogi, Liu, et al. 2023).

Given these findings, a hybrid approach might work. We could maintain double-blind conditions during initial reviews, then reveal identities after initial scores are submitted. This “middle ground” aims to balance reducing bias with the benefits of contextual reading and COI checks (Nihar B. Shah 2025).

10 Calibration and aggregation of ratings

A persistent challenge in evaluating reviews is that reviewers use scoring rubrics (i.e. scores from 1 to 10 or whatever) differently. Although reviewers all use the same nominal scale when assessing a paper, their scores are often not well calibrated to one another. Some reviewers may be systematically harsher or more lenient. This is a familiar problem in survey methodology, where respondents’ scores are, at best, internally consistent.

Simply averaging the scores can be problematic, as it “bakes in” this miscalibration. Furthermore, as we know from social choice theory, simple averaging may not be the optimal way to aggregate different perspectives. The literature suggests several approaches that might improve the aggregation process:

Community-learned aggregation: Instead of assuming a fixed relationship between aspect scores (e.g., novelty, rigour) and the overall recommendation, we can learn a mapping (e.g., L(1,1)-style) that reflects the community’s preferences. This helps reduce the impact of individual reviewers’ idiosyncratic weighting schemes (Noothigattu, Shah, and Procaccia 2021).
Calibration with confidence: Asking reviewers to provide not just a score but also a confidence level can help in calibration. By analyzing confidence across the panel graph, it may be possible to adjust for differences in scale usage and leniency (MacKay et al. 2017).
Least-squares calibration: This framework offers a flexible approach to correcting for bias and noise beyond simple linear miscalibration (Tan et al. 2021). TODO: revisit this one — it sounds fun.

A simpler approach might involve collecting confidence signals and learning a post-hoc transformation to map each reviewer’s scores to a shared scale.

11 Eliciting Truth without Ground Truth

Rewarding high-quality reviews is difficult when there is no objective “ground truth” against which to measure them. Peer-prediction mechanisms, such as the Bayesian Truth Serum (Prelec 2004) and Peer Truth Serum (Radanovic, Faltings, and Jurca 2016), offer a theoretical way to incentivize honest and effortful reports by comparing reviewers’ reports against each other. These mechanisms have inspired concrete designs for peer review, including proposals for auction-funded reviewer payments scored via peer prediction (Srinivasan and Morgenstern 2023).

However, these mechanisms introduce significant conceptual and UX complexity and can be brittle under conditions of collusion or correlated errors. For these reasons, they may be challenging to implement initially. A lighter alternative might be to combine forecast scoring with post-hoc audits to generate observable outcomes.

12 Automated Screening and Checks

Automated tools can supplement the human review process by screening for common issues.

Statistical auditing: Tools like Statcheck can automatically parse manuscripts and flag basic statistical inconsistencies. A natural experiment suggests that implementing such checks in the workflow can lead to large reductions in reporting errors (Nuijten and Wicherts 2024).
Reproducibility checks: For computational work, journals can require authors to submit an “auditability pack” (e.g., environment details, code, tests) and run basic execution checks upon submission.

It’s (morally speaking) important to treat the output of these tools as a triage signal for editors and reviewers, rather than a basis for automatic rejection, as false positives are expected. Whether these tools produce fewer false positives than humans is an empirical question.

13 Designing an Experimental Journal

Thought experiment: what should we do if we launched a hypothetical new green-field journal, assuming time and money were no object?

Translating these theoretical proposals and experimental findings into the design of a new journal requires careful consideration of implementation costs and the specific goals of the journal. A pragmatic approach might involve layering several interventions that are compatible with a small editorial team and limited budget, and running incremental experiments to evaluate their impact.

Based on the literature surveyed, several components appear promising for an initial pilot:

Incentive Alignment via Admission Control: Implementing a review credit system to link author submissions with review contributions. Using stochastic penalties (e.g., heavier scrutiny for non-compliant authors) rather than hard bans is crucial when effort measurement is noisy.
Robust and Fair Assignment: Adopting randomized assignment methods with a fairness objective can improve robustness to manipulation and ensure equitable distribution of review quality (Stelmakh, Shah, and Singh 2021; Y. E. Xu et al. 2024). This requires a one-time engineering effort.
Improved Evaluation Metrics: Moving beyond simple averaging by incorporating calibrated aggregation and forecast questions with proper scoring rules. Collecting aspect scores and confidence levels allows for post-hoc calibration (Noothigattu, Shah, and Procaccia 2021; MacKay et al. 2017). Forecasts can be rewarded with review credits (converted to benefits like fee waivers) rather than cash.
Hybrid Anonymity: Employing a “middle-ground reveal” —blind through initial reviews, then revealing identities—to balance bias reduction with contextual checks (Nihar B. Shah 2025).
Supplemental Checks: Integrating automated triage for statistical or reproducibility checks (Nuijten and Wicherts 2024). Potentially introducing a bounded AI arbitration step to assist meta-reviewers in resolving specific disputes (Allen-Zhu and Xu 2025).
Anti-Collusion Measures: Enforcing cycle-free constraints and randomized assignments (Jecmen et al. 2020; Boehmer, Bredereck, and Nichterlein 2022).
Cash Payments for Reviews. Direct cash payments or cryptocurrency mechanisms (DOGE 2.0 vision (Allen-Zhu and Xu 2025)) involve complex accounting and policy issues. Starting with a credits-for-benefits system sounds more practical. But cash is immediately understandable and fungible, so it might be worth piloting in a small setting.
Lotteries for Acceptance. Pure lotteries for acceptance, while used by some funding agencies with mixed results (Feliciani, Luo, and Shankar 2024; M. Liu et al. 2020), may be controversial for journals. Improving incentives and calibration should probably take precedence, although a lottery within a narrow “tie band” might be worthwhile.

14 Incoming

Matthew Feeney, Markets in fact-checking
Transparent Peer Review: A New Era for Scientific Publishing | The Scientist
2024 Conference – NeurIPS Blog
CS Paper Reviews — a tool to review papers and increase our chances of acceptance
Short Pieces on Reviewing — EMNLP 2022
Saloni Dattani, Real peer review has never been tried
Matt Clancy, What does peer review know?
Adam Mastroianni, The rise and fall of peer review
The Myth of the Expert Reviewer
Science and the Dumpster Fire | Elements of Evolutionary Anthropology
F1000Research | Open Access Publishing Platform | Beyond a Research Journal

F1000Research is an Open Research publishing platform for scientists, scholars and clinicians offering rapid publication of articles and other research outputs without editorial bias. All articles benefit from transparent peer review and editorial guidance on making all source data openly available.

Reviewing is a Contract — Rieck on the social expectations of reviewing and chairing.
Jocelynn Pearl proposes some fun ideas — including blockchain-y ideas — in Time for a Change: How Scientific Publishing is Changing For The Better.
The Black Spatula Project — Steve Newman

A 10 page paper caused a panic because of a math error. I was curious if Al would spot the error by just prompting: “carefully check the math in this paper” especially as the info is not in training data.

o1 gets it in a single shot. Should Al checks be standard in science?

Repository: nick-gibb/black-spatula-project: Verifying scientific papers using LLMs

15 References

Aksoy, Yanik, and Amasyali. 2023. “Reviewer Assignment Problem: A Systematic Review of the Literature.” Journal of Artificial Intelligence Research.

Allen-Zhu, and Xu. 2025. “DOGE: Reforming AI Conferences and Towards a Future Civilization of Fairness and Justice.” SSRN Scholarly Paper.

Becerril, Bjørnshauge, Bosman, et al. 2021. “The OA Diamond Journals Study, Part 2: Recommendations.” Copyright, Fair Use, Scholarly Communication, Etc.

Boehmer, Bredereck, and Nichterlein. 2022. “Combating Collusion Rings Is Hard but Possible.” In Proceedings of the AAAI Conference on Artificial Intelligence.

Bosman, Frantsvåg, Kramer, et al. 2021. “The OA Diamond Journals Study. Part 1: Findings.”

Budish, Che, Kojima, et al. 2009. “Implementing Random Assignments: A Generalization of the Birkhoff-von Neumann Theorem.” In Cowles Summer Conference.

Charlin, and Zemel. 2013. “The Toronto Paper Matching System: An Automated Paper-Reviewer Assignment System.”

Charlin, Zemel, and Boutilier. 2011. “A Framework for Optimizing Paper Matching.” In UAI2011.

Cole, Jr, and Simon. 1981. “Chance and Consensus in Peer Review.” Science.

Cousins, Payan, and Zick. 2023. “Into the Unknown: Assigning Reviewers to Papers with Uncertain Affinities.” In Algorithmic Game Theory.

Couto, Ho, Kumari, et al. 2024. “RelevAI-Reviewer: A Benchmark on AI Reviewers for Survey Paper Relevance.”

Deligkas, and Filos-Ratsikas. 2023. Algorithmic Game Theory: 16th International Symposium, SAGT 2023, Egham, UK, September 4–7, 2023, Proceedings.

Eve, Neylon, O’Donnell, et al. 2021. “Reading Peer Review: PLOS ONE and Institutional Change in Academia.” Elements in Publishing and Book Culture.

Faltings, Jurca, and Radanovic. 2017. “Peer Truth Serum: Incentives for Crowdsourcing Measurements and Opinions.”

Feliciani, Luo, and Shankar. 2024. “Funding Lotteries for Research Grant Allocation: An Extended Taxonomy and Evaluation of Their Fairness.” Research Evaluation.

Fernandes, Siderius, and Singal. 2025. “Peer Review Market Design: Effort-Based Matching and Admission Control.” SSRN Scholarly Paper.

Flach, Spiegler, Golénia, et al. 2010. “Novel Tools to Streamline the Conference Review Process: Experiences from SIGKDD’09.” SIGKDD Explor. Newsl.

FreundYoav, IyerRaj, E, et al. 2003. “An Efficient Boosting Algorithm for Combining Preferences.” The Journal of Machine Learning Research.

Gasparyan, Gerasimov, Voronov, et al. 2015. “Rewarding Peer Reviewers: Maintaining the Integrity of Science Communication.” Journal of Korean Medical Science.

Gneiting, and Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association.

Goldberg, Stelmakh, Cho, et al. 2025. “Peer Reviews of Peer Reviews: A Randomized Controlled Trial and Other Experiments.” PLOS ONE.

Goldsmith, and Sloan. 2007. “The AI Conference Paper Assignment Problem.” In Proc. AAAI Workshop on Preference Handling for Artificial Intelligence.

Hardt. 2025. The Emerging Science of Machine Learning Benchmarks.

Jecmen, Zhang, Liu, et al. 2020. “Mitigating Manipulation in Peer Review via Randomized Reviewer Assignments.” In Advances in Neural Information Processing Systems.

Jecmen, Zhang, Liu, et al. 2022. “Near-Optimal Reviewer Splitting in Two-Phase Paper Reviewing and Conference Experiment Design.” In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. AAMAS ’22.

Lindsey. 1988. “Assessing Precision in the Manuscript Review Process: A Little Better Than a Dice Roll.” Scientometrics.

Littman. 2021. “Collusion Rings Threaten the Integrity of Computer Science Research.” Communications of the ACM.

Liu, Mengyao, Choy, Clarke, et al. 2020. “The Acceptability of Using a Lottery to Allocate Research Funding: A Survey of Applicants.” Research Integrity and Peer Review.

Liu, Xiang, Suel, and Memon. 2014. “A Robust Model for Paper Reviewer Assignment.” In Proceedings of the 8th ACM Conference on Recommender Systems.

Luce. 2001. “Reduction Invariance and Prelec ’ s Weighting Functions.” Journal of Mathematical Psychology.

MacKay, Kenna, Low, et al. 2017. “Calibration with Confidence: A Principled Method for Panel Assessment.” Royal Society Open Science.

Marcoci, Vercammen, Bush, et al. 2022. “Reimagining Peer Review as an Expert Elicitation Process.” BMC Research Notes.

Merrifield, and Saari. 2009. “Telescope Time Without Tears: A Distributed Approach to Peer Review.” Astronomy & Geophysics.

Mimno, and McCallum. 2007. “Expertise Modeling for Matching Papers with Reviewers.” In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’07.

Noothigattu, Shah, and Procaccia. 2021. “Loss Functions, Axioms, and Peer Review.” Journal of Artificial Intelligence Research.

Nuijten, and Wicherts. 2024. “Implementing Statcheck During Peer Review Is Related to a Steep Decline in Statistical-Reporting Inconsistencies.” Advances in Methods and Practices in Psychological Science.

Okike, Hug, Kocher, et al. 2016. “Single-Blind Vs Double-Blind Peer Review in the Setting of Author Prestige.” JAMA.

Potts, Hartley, Montgomery, et al. 2016. “A Journal Is a Club: A New Economic Model for Scholarly Publishing.” SSRN Scholarly Paper.

Prelec. 2004. “A Bayesian Truth Serum for Subjective Data.” Science.

Prelec, Seung, and McCoy. 2017. “A Solution to the Single-Question Crowd Wisdom Problem.” Nature.

Radanovic, and Faltings. 2013. “A Robust Bayesian Truth Serum for Non-Binary Signals.” In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. AAAI’13.

Radanovic, Faltings, and Jurca. 2016. “Incentives for Effort in Crowdsourcing Using the Peer Truth Serum.” ACM Transactions on Intelligent Systems and Technology.

Ragone, Mirylenka, Casati, et al. 2013. “On Peer Review in Computer Science: Analysis of Its Effectiveness and Suggestions for Improvement.” Scientometrics.

Rodriguez, and Bollen. 2008. “An Algorithm to Determine Peer-Reviewers.” In Proceedings of the 17th ACM Conference on Information and Knowledge Management. CIKM ’08.

Shah, Nihar B. 2022. “Challenges, Experiments, and Computational Solutions in Peer Review.” Communications of the ACM.

Shah, Nihar B. 2025. “An Overview of Challenges, Experiments, and Computational Solutions in Peer Review (Extended Version).”

Shah, Nihar B, Tabibian, Muandet, et al. 2016. “Design and Analysis of the NIPS 2016 Review Process.”

Smith. 2006. “Peer Review: A Flawed Process at the Heart of Science and Journals.” Journal of the Royal Society of Medicine.

Srinivasan, and Morgenstern. 2023. “Auctions and Peer Prediction for Academic Peer Review.”

Stelmakh, Rastogi, Liu, et al. 2023. “Cite-Seeing and Reviewing: A Study on Citation Bias in Peer Review.” PLOS ONE.

Stelmakh, Rastogi, Shah, et al. 2023. “A Large Scale Randomized Controlled Trial on Herding in Peer-Review Discussions.” PLOS ONE.

Stelmakh, Shah, and Singh. 2021. “PeerReview4All: Fair and Accurate Reviewer Assignment in Peer Review.” Journal of Machine Learning Research.

Su. 2022. “You Are the Best Reviewer of Your Own Papers: An Owner-Assisted Scoring Mechanism.”

Tang, Tang, and Tan. 2010. “Expertise Matching via Constraint-Based Optimization.” In 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

Tan, Wu, Bei, et al. 2021. “Least Square Calibration for Peer Reviews.” In Proceedings of the 35th International Conference on Neural Information Processing Systems. Nips ’21.

Taylor. 2008. “On the Optimal Assignment of Conference Papers to Reviewers.”

Thurner, and Hanel. 2010. “Peer-Review in a World with Rational Scientists: Toward Selection of the Average.”

Tran, Cabanac, and Hubert. 2017. “Expert Suggestion for Conference Program Committees.” In 2017 11th International Conference on Research Challenges in Information Science (RCIS).

van Rooyen, Godlee, Evans, et al. 1999. “Effect of Open Peer Review on Quality of Reviews and on Reviewers’ Recommendations: A Randomised Trial.” BMJ : British Medical Journal.

Vijaykumar. 2020. “Potential Organized Fraud in ACM.”

Ward, and Kumar. 2008. “Asymptotically Optimal Admission Control of a Queue with Impatient Customers.” Mathematics of Operations Research.

Whitehurst. 1984. “Interrater Agreement for Journal Manuscript Reviews.” American Psychologist.

Wu, Xu, Guo, et al. 2025. “An Isotonic Mechanism for Overlapping Ownership.”

Xiao, Dörfler, and Schaar. 2014. “Incentive Design in Peer Review: Rating and Repeated Endogenous Matching.”

Xiao, Dörfler, and van der Schaar. 2014. “Rating and Matching in Peer Review Systems.” In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

Xu, Yixuan Even, Jecmen, Song, et al. 2024. “A One-Size-Fits-All Approach to Improving Randomness in Paper Assignment.” In Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23.

Xu, Yichong, Zhao, and Shi. 2017. “Mechanism Design for Paper Review.”