Garbled highlights from Neurips 2025
2025-12-03 — 2025-12-10
Wherein attendance at NeurIPS in San Diego is recorded, US and Chinese paper counts are noted to be nearly equal, a Post‑AGI workshop on economics is summarised, and a bespoke semantic paper‑search tool is described.
I’m at NeurIPS in San Diego this year to present our paper (Davies et al. 2025), and I’ve attended a number of side events and workshops as well. Here are my impressions.
1 Vibes
Previously, I’ve voiced opinions on flagship machine learning and artificial intelligence conferences. To recap where we are — flagship conferences like NeurIPS, ICLR and ICML serve, in my gradually evolving view, several purposes that are increasingly in tension with each other:
- NeurIPS is a high-prestige venue that winnows interesting algorithms from the community, through its tightly engineered review process and—compared to classical science—its unusual focus on benchmarking, and also…
- a massive job fair and recruitment show for the tech industry, and also…
- a ledger of points scored in computer science careers, now baked into promotion and hiring systems at universities worldwide, and…
- a geopolitical pissing contest between nations that want to compete in machine learning
Researchers often say that it would be more fun and more productive to have several smaller, more focused conferences, but the incentives above work against that.
NeurIPS is a great place to take a vibe check of the field of machine learning and artificial intelligence.
Key question to ask at this particular AI event: Where are we on the hype cycle? The usual difficulty of assessing “hype” claims applies, which is to say: if I want to claim something is overhyped, I have to know how high everyone’s expectations should be and what everyone’s expectations actually are. These are both individually hard to assess. As far as I can tell, almost everyone on the planet has expectations by now, often deeply informed ones1 and I can’t possibly ask them all. And as for what the expectations should be — that is, of course, naked speculation at the very frontier of what’s knowable, so certainty will be elusive.
Approximately speaking, I agree with Yann LeCun: we are in an LLM bubble, in the sense that just naïvely scaling existing LLMs by pre-training plus RLHF seems to have run out of juice and will not get us to the moon. It looks like we cannot simply take stuff off the shelf, “go bigger”, and get better results.
There’s a broader question about whether we’re in an AI bubble: Do there exist no possible new tricks that will take generative AI all the way to unambiguous superintelligence? Or if such tricks exist, will we not find them soon, and will the current financial markets—hungry for quick wins—slump when massive investments fail to produce massive returns, and then we will fail to find them because we will all be broke and demoralized?
It’s possible we could both be in a bubble, (with people excited about bad ideas where most frothy AI startups are ultimately doomed), and also not in a bubble (in the sense that some projects might be so vastly underhyped that their eventual wins outweigh all the waste and bullshit.) This depends on what you mean by “bubble”, and what you mean by “AI”.2
Of course, I can’t know what’s waiting around the corner in the world of innovation. One thing that seems clear is that we are in the middle of a moonshot. If I go to NeurIPS and look around — who’s presenting and who’s not, who seemed promising at previous conferences but has vanished into a large tech firm to labour in secret — I infer a growing mass of research dark matter, an unseen bolus of methods and knowledge we can only detect indirectly. Watch the OpenAI lanyard-bearers perturbing the observable researchers like axions amongst baryons.
An unmatched number of the greatest and most passionate minds of our planetary civilisation have been recruited to a single goal, resourced like never before, and given the mission, scope and creativity to deliver at a scale humanity has never seen. New methods in active learning, reinforcement learning, continual learning and other areas that might help turn today’s LLMs into tomorrow’s AGIs are being pursued, developed and resourced.
We should probably have the same attitude to the ASI moonshot as we should have had to other well-resourced, ambitious, nearly impossible goals, such as that first literal moonshot: I don’t know if we can do it, but if we can, this is how.
This moonshot differs from the original namesake in many ways, obviously: Notably, it’s not a unified national project, at least not in the West, but rather a competitive, secretive, capitalist endeavour, or, really, a collection of factionally competing endeavours. That doesn’t mean it isn’t of comparable scale. The great project of this generation is to throw a javelin at the sun, and it might be a few javelins.
It’s easy to think we will fail because society has grown used to being unable to accomplish big things, in the face of the powerlessness we experience day-to-day. We, in the West, are used to the idea that we cannot coordinate on any big thing — on climate change, on pandemics, on basic infrastructure, on government services, on curing cancer… Working in universities or other large bureaucracies, we might easily acquire a vague feeling of general defeat, a diminished sense of agency, a weakened sense of our own ability to accomplish things as a society. We’re perhaps too accustomed to government systems that have been hamstrung by corporate or lobbyist capture, by pointless proceduralism, austerity, general congestion, and endless squabbles over dwindling reserves of agency. While many Western democracies are suffering great malaise and ploughing our capacity into recrimination, tech firms swim upstream. The societal malaise is not a major part of their calculations for now. They have enough lobbyists to keep themselves safe from most political headwinds, and they have enough money to throw at the problem that they can mostly ignore the rest of the world’s dysfunction. The AI ecosystem is enjoying a rare moment of high agency, high alignment, concerted effort and rapid, unconstrained execution. We, humanity, will not fail this task for lack of effective action.
Now, we might fail in this task because it is literally impossible. It boils down to faith, for now, whether we believe synthetic superintelligence is possible. Maybe tricksy evolution or a cunning Creator imbued human meat with a divine spark that cannot be captured by any mathematics that silicon can compute. Maybe no gigawatt data-centre full of straining GPUs can ever match the fatty mass of my brain, no matter how many petaflops it can muster, no matter how subtly they are orchestrated, no matter that it has a vast store of knowledge of every intimate detail of the world and access to the vast libraries of all that has been written, and thousands of person-years of researchers and engineers of great genius, racing towards the greatest prize in history. Maybe we are protected by some variant of the omnipotence paradox, where we are so smart that we cannot make something smarter than ourselves, or, perhaps, so dumb that we cannot make something smarter than ourselves. Let us place bets about it.
I’m not promising that any particular two‑bit huckster’s vanity AI startup is going anywhere — I have met a number of founders and CEOs of AI startups at this conference, and let me tell you, there are some terrible ideas. Nor would I promise we won’t tank the economy before the next step‑change discovery that surpasses human cognition — if such a discovery occurs at all. I am, above all, not promising that it would be good for us to achieve artificial superintelligence — rather, I think that, were it possible, it would be the most dangerous thing we could possibly do.
However, at this conference, overshadowed by the juggernaut tech industry that our field created, I think we all intuit, to some degree, that the ASI moonshot is the main game, and the rest of us are in the wake of that rocket ship, whatever it may crash into. There’s a tacit sense that the computers and their corporations are coming for everything human, and that all our work is sublimating into this final push to build a new god. We are waiting for thunder from the silent zone.
For many of us, that is. Some people are just publishing normal papers. Don’t ask me the ratios.
Within that broad picture there is a lot of interesting texture. The AI moonshot is playing out differently in different geopolitical contexts. We see massive returns to scale in US corporations and Chinese universities.
In the US, big tech firms are driving the effort, with universities playing a secondary role; state-backed, publicly funded research labs are gradually being starved of funding.
In China, open research from universities seems to be playing a much larger role. If we have laboured under the incorrect impression that Tsinghua or Peking University were second-tier research institutions, it’s time to reassess. These universities may in fact be the leading AI research institutions on the planet now, and certainly the biggest universities in this space. If we thought that China was more secretive or inscrutable than the “open” societies of the West, we’ll find that hard to prove by looking at the list of public-good research outputs.
Note also that this year overall China and the US are neck-and-neck in terms of paper counts at NeurIPS. The US was way ahead only last year; by any plausible extrapolation, China is likely to be thoroughly ascendant next year. This was probably the final year the US came out on top in that pissing contest. Second place from here on out.
In the rest of the world, the picture is complicated and resists easy slogans. Mid-size countries can punch above or below their weight, but they aren’t going to lead the world. Some places are doing great for their size — Switzerland, Singapore, UAE, Israel, Korea, Canada. The UK and Germany were present, but look a little more marginal given their populations and histories of research excellence. Australia is still hanging in there, which is remarkable given the parlous state of science in Australia.
It’s not all publication records. Venture capital is still around. VCs are sniffing around at lunch. The big tech firms are still gobbling up talent. The startups are still running parties and mixers. Early-stage founders still give me brief demos of technologies that look set to render whole economic sectors irrelevant and reshape the modern capital-labour settlement entirely. Even if no methodological breakthroughs happen, the economic and social transformations wrought by scaling up LLMs and their ilk will keep grinding on and leave us with a world unrecognizable on some timescale, if these firms have their way.
Strap in, dear reader.
1.1 Catering
The coffee is still swill and the food is still over-sweetened, highly processed and totally free from dietary fibre.
1.2 Location
I have not been to many American cities, but I’ve learned to steel myself for traffic jams, filth and constant pharmaceutical advertising.
San Diego is… surprisingly nice. Streets are clean, public transport is affordable and high-frequency, people are generally chill, parks are nice and the homeless folks are pretty friendly.
2 Post-AGI workshop
The Post-AGI Workshop: Economics, Culture and Governance was amazing. Thanks David Duvenaud, Stephen Casper, Maria Kostylew, Raymond Douglas, Jan Kulveit, Olivia DiGiuseppe, and Veronika Nývltová for an incredible workshop.
Mission description:
This workshop is an attempt to make sense of how the world might look after the development of AGI, and what we should do about it now. We make two assumptions — that AI will eventually be transformative, and that advanced AI will be controllable or alignable — and then ask: what next?
I only had time to take notes from some sessions — it was an intense day — but everything I’m mentioning will be worth watching on YouTube when the talks are posted; I rarely make that recommendation.
2.1 Economics of Transformative AI
Anton Korinek gave us a magisterial presentation on economic scenarios after AGI:
2.2 Modern AI Is Optimized for Political Control
Fazl Barez gave a fascinating talk on AI and how effective it can be at controlling populations.
I couldn’t find published work from him in this area, but the talk was fascinating, and it’s worth watching Lowe et al. (2025).
2.3 What would UBI actually entail?
Anna Yelizarova made many points, but the one that stuck with me was that UBI seems to require governments to effectively own a stake in major AI firms rather than simply tax them — and that isn’t the model in most places.
2.4 When does competition lead to recognizable values?
Beren Millidge: This talk made me feel incredibly seen. Beren had many abstractions that articulated things I’ve been thinking about recently. I’ll post his slides when they’re available.
2.5 Concrete mechanisms for slow loss of control
Deger Turan from Metaculus tried to tie structured prediction markets to conditional scenario planning — interesting, if it works (pace dynomight). I couldn’t find the methodology with a quick search, but see Figure 3 for an example.
2.6 Supercooperation as an alternative to Superintelligence
Ivan Vendrov: What if benign intelligence were something like civitech on speed? The talk was by a guy whose many credits include founding fractalNYC.
2.7 Cyborg Leviathans and Human Niche Construction
Anders Sandberg: “Ethics is more or less bunk” — but trust is necessary. Interesting connection to empirical moral foundations.
2.8 More panels and talks but I ran out of time to take notes
- Kunal Handa
- Atoosa Kasirzadeh
- William MacAskill (“everyone should be allocated a share of the total solar output”)
- Richard Ngo
- Rif A. Saurous
- Jan Kulveit
- Rishi Bommasani
- Nick Bostrom
- Dr Iason Gabriel
- William Cunningham
3 Finding papers
There are two first-party NeurIPS paper search engines, each of which sucks differently:
Dissatisfied with both, I built a quick semantic search while I was waiting in line at the local pharmacy for my flu vaccination; it’s better than either. We can find it here if we want to install it on our machines: danmackinlay/openreview_finder. It takes about 10 minutes to download and index the NeurIPS 2025 papers. It’s a few hundred megabytes, so I haven’t deployed it on the open internet; that’s left as an exercise for a student.
Another competitor in the space is Paper Lantern, which did much more handbilling than I did and has a pretty good search. I think it’s about as good as mine in result quality, but the UX is better. OTOH, mine is open source and spyware‑free, and easy to tweak to our taste.
4 Finding people
There’s an app. It’s not a good app. We should just email the authors of papers we like. Corollary: Email me if you like my papers, or generally wish to chat. I like to chat.
5 Main track
My current conference policy is that any panel, talk, tutorial, presentation or keynote is pure deadweight loss. If I’ sitting and someone is talking at me, I should get out before I waste any more time. Exception: It is OK to visit a talk if I want a dark room to nap in.
Otherwise, going to one of those events at the actual conference takes time away from meeting people and talking about science. If I find any such non-interactive event worth attending, it’s also worth watching the recording later, where I can play it back at 1.5× speed in the boring bits and pause when I get confused. As such, I did not attend any main track events apart from poster sessions; I’ll watch the good ones next week online.
Highlights from posters below.
5.1 Bleeding-edge Reinforcement learning
RL for theorem proving and hierarchical RL. There’s a lot of hype and work here, and a conspicuous absence of several big players, so I suspect some big tech firms are making ASI plays.
- Beechey and Şimşek (2025)
- Chemingui et al. (2025)
- Fiskus and Shaham (2025)
- Korkmaz (2025)
- Li, Zrnic, and Candes (2025)
- Walder and Karkhanis (2025) — suspiciously useful work from DeepMind; how did this escape the lab? Anyway, nice to see that Christian still emerges from seclusion.
- J. L. Zhou and Kao (2025)
- Lamont et al. (2025) / sean-lamont/3D-Prover
5.2 Causality
There’s a lot happening in this space, which is still my area.
I’m particularly excited about approaches that do active causal discovery and causal representation learning without assuming a fully observable world. Ideally they don’t need a full causal graph at all and can work with partial, latent, or stochastic causal graphs. There continues to be interesting progress in causal ordering and causal abstraction approaches that do just that.
5.3 Multi agent systems and games
This area has a lot of action, but less than it should, IMO, given how important multi-agent systems are to real-world intelligence (speaking as a real-world intelligence myself).
5.4 Galaxy Brained
I serendipitously discovered the following very cosmic papers; they contain so much left-field weirdness that they resist quick summary and defy categorization. I think they might be on to something. Or, more precisely, several different things, not least of which is dragging NeurIPS back to its trippy weirdo roots.
6 Workshops
Interesting workshops listed below.
The trick with a good workshop seems to be having a tight theme, a good set of organizers, and, crucially, topic matter which is not irrelevant and yet not so hyped that all the good content has already been hoovered into some large tech firm’s R&D lab. As such, the best workshops are not quite hip, but still interesting, and have a little outsider energy.
Workshops are very high bandwidth, so I won’t have a lot of time for details about each; my brain will be busy asking PhD students curveball questions about their papers.
The workshop descriptions were written with an LLM — beware.
6.1 Saturday, 2025/12/06
Structured Probabilistic Inference & Generative Modeling (SPIGM)
Focused on the intersection of probabilistic methods and large foundation models, this workshop explores how to handle highly structured data across fields like computer vision and the natural sciences. It brings together experts to discuss the challenges of encoding domain knowledge and performing inference in complex, generative settings.
There was an amazing poster session with lots of neat ideas. Check it out.
-
I don’t even have an LLM summary for this — it was such a dark horse, and I wandered in by accident. Let me quote from the site:
Dynamical systems have played an important role in the analysis and design of algorithms. Ideas ranging from variational methods, differential and symplectic geometry, numerical analysis, and control theory have paved the way for establishing non-asymptotic convergence guarantees in optimization, sampling, and equilibrium computation in games. Yet, the distinct mathematical backbone of these tools often creates barriers to entry for researchers and practitioners in machine learning.
This was amazing. We should definitely read the proceedings to stay ahead; I suspect much of this will recur at future conferences.
Machine Learning and the Physical Sciences (ML4PS)
This long-running workshop serves as a hub for research at the nexus of machine learning and the physical sciences, covering both ML applications to scientific problems and physics-inspired improvements to ML models. This year’s program highlights the interplay between academia and industry in driving fundamental research and “big science” innovations.
A solid, reliable workshop in an area I’d argue is no longer “fringe” but that has arrived, even if it isn’t evenly distributed yet. This year wasn’t mind‑blowing, but I saw strong progress in applied work on important problems. Watch this space; designing new drugs, machines, and industrial processes will make real differences to people’s lives.
CauScien: Uncovering Causality in Science
This workshop aims to bridge the gap between theoretical advances in causal learning and their practical application in scientific domains like ecology, biology, and social science. It focuses on fostering collaborations between ML researchers and domain experts to address how causal inference can effectively accelerate real-world scientific discovery.
I was pumped for this one — it’s my research area. Some papers looked promising, but their signal-to-noise ratio wasn’t high enough for the problems I’m trying to solve.
Reliable ML from Unreliable Data
This workshop tackles the problem of building robust machine learning systems when training data is missing, corrupted, or strategically manipulated. It bridges theory and practice to address challenges such as distribution shift, adversarial robustness, and strategic behavior in socio-technical systems.
Once again, great topic, but I don’t see quite enough signal here to match my interests.
Algorithmic Collective Action (ACA)
ACA investigates how participants in socio-technical systems can use collective strategies to shape AI development and outcomes from the “bottom up.” The event invites interdisciplinary research from machine learning, economics, and the humanities to understand how coordinated user efforts can steer algorithms toward the common good.
I’m passionate about this area, but the workshop didn’t quite deliver for me. I’d rather go to a civitech meetup for civil tech content. That said, I’m grateful to have learned about Saiph Savage; their work looks useful, so I’ve bookmarked it.
6.2 Sunday, 2025/12/07
The 3rd Workshop on Regulatable ML
The main goal of this workshop is to bridge the gap between state-of-the-art ML safety/security research and evolving regulatory frameworks.
Another dark-horse, surprisingly good workshop — it dug into the gap between how we benchmark AI systems, how we run safety evals, how we assess safety, and how regulators want (or don’t want) to regulate AI systems.
It turns out human societies are complicated, AI is too, and combining them is no simpler than the sum of their parts.
CogInterp: Interpreting cognition in deep learning models
CogInterp bridges the gap between cognitive science and AI interpretability by using cognitive frameworks to describe the internal processes of deep learning models. The goal is to understand not just what tasks a model can perform, but the specific “mental” algorithms and intermediate representations it uses to achieve them.
Tackling Climate Change with Machine Learning
This workshop highlights machine learning innovations that can reduce greenhouse gas emissions and support societal adaptation to the effects of climate change. It facilitates collaboration between ML researchers and climate experts to solve high-impact problems ranging from energy efficiency to advanced climate modeling.
Frontiers of Probabilistic Inference (FPI)
FPI connects classical statistical sampling with modern machine learning techniques to create scalable and data-efficient inference methods. Discussions will cover theoretical perspectives, connections to generative models, and practical applications across the natural sciences and large language models.
Great workshop — there’s a lot of overlap with SPIGM, so my review applies to both.
Workshop on Learning to Sense (L2S)
The Learning to Sense workshop focuses on the joint optimization of sensor hardware and machine learning models to replace traditional, hand-crafted data acquisition pipelines. It invites research on topics such as learnable sensor layouts and raw-to-task approaches that enhance performance for specific downstream applications.
I didn’t make it — I ran out of time.
Constrained Optimization for Machine Learning
This workshop examines constrained optimization as a principled method for embedding safety, fairness, and robustness requirements directly into machine learning training. It brings together experts to tackle the difficulties of applying these constraints to complex, large-scale, and stochastic deep learning systems.
I didn’t make it.
Symmetry and Geometry in Neural Representations (NeurReps)
NeurReps explores how principles of geometry, topology, and symmetry can unify our understanding of representations in both biological brains and artificial neural networks. The workshop fosters dialogue between mathematicians, neuroscientists, and ML researchers to uncover mathematical structures that lead to robust and interpretable learning.
I didn’t make it to the workshop.
Workshop on Continual and Compatible Foundation Model Updates (CCFM)
This workshop addresses the critical challenge of keeping foundation models up-to-date without incurring the high costs of full retraining. It explores methods for frequent, compatible updates that minimize catastrophic forgetting while maintaining a consistent user experience.
I didn’t make it.
7 Eval Eval workshop
A workshop on AI Evals, the 2025 Workshop on Evaluating AI in Practice, was hosted by a coalition of Eval eval, UK AISI and UC San Diego (UCSD).
I hadn’t realized how the study of AI evals had coalesced into something distinct from benchmarking. Notably, it presents itself as much more like classical science than the benchmark world science: it treats AI models as observable phenomena that can be studied in their own right and that we analyze using RCTs, standard experimental methods, and statistical techniques.
Our NeurIPS 2024 “Evaluating Evaluations” workshop kicked off the latest work streams, highlighting tiny paper submissions that provoked and advanced the state of evaluations. Our original framework for establishing categories of social impact is now published as a chapter in the Oxford Handbook on Generative AI and we hope it will guide standards for Broader Impact analyses. We’ve previously hosted a series of workshops to hone the framework.
Hosted by Hugging Face, University of Edinburgh, and EleutherAI, this cross‑sector and interdisciplinary coalition operates across three working groups: Research, Infrastructure and Organization.
Stella Biderman of EleutherAI gave the keynote, which introduced me to several papers by her and her team (Biderman, Prashanth, et al. 2023; Biderman, Schoelkopf, et al. 2023; Wal et al. 2024) and to Lesci et al. (2024), which served as a rebuttal to Biderman’s own devinterp work.
The family of pre-trained models Eleuther created is remarkably dense. See EleutherAI/pythia: The hub for EleutherAI’s work on interpretability and learning dynamics.
How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce Pythia, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We intend Pythia to facilitate research in many areas, and we present several case studies including novel results in memorization, term frequency effects on few-shot arithmetic performance, and reducing gender bias. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics.
Cozmin Udedec introduced us to UKGovernmentBEIS/hibayes (Luettgau et al. 2025), a convenience wrapper for Bayesian GLM analysis of behaviour in LLMs.
There’s a lot of convenience here.
8 Interesting people met
- Gautam Kamath graciously helped me navigate the ups and downs of starting a journal.
9 Interesting tech firms encountered
- Secondmind is building interesting AI for design.
- Harvey showed me an impressive demo of their LLM for legal work.


