Advice to pivot into AI Safety is likely miscalibrated
Aligning our advice about aligning AI
2025-09-28 — 2025-10-25
Wherein the AI‑safety career‑advice ecosystem is described, its high failure tolerance and its lack of mooring to ground-truth or optimality is noted.
Assumed audience:
Mid career technical researchers considering moving into AI Safety research, career advisors in the EA/AI Safety space, AI Safety employers and grantmakers
Here’s a napkin model of the AI-safety career-advice economy, or rather, three models of increasing complexity. They sketch how advice can sincerely recommend gambles that mostly fail, and why—without better data—we can’t tell whether that failure rate is healthy (leading to impact at low cost) or wasteful (potentially destroying happiness and even impact). In other words, it’s hard to know whether our altruism is “effective”.
In AI Safety in particular, there’s an extra credibility risk that’s idiosyncratic to this kind of system. AI Safety is, loosely speaking, about managing the risks of badly aligned mechanisms producing perverse outcomes. As such, it’s particularly incumbent on our field to avoid badly aligned mechanisms that produce perverse outcomes; otherwise we aren’t taking our own risk model seriously.
In order to keep things simple, we ignore as many complexities as possible.
- We evaluate decisions in terms of cause impact, which we assume we can price in donation-equivalent dollars. This is a public good.
- Individual private goods are the candidate’s own gains and losses. We assume that their job satisfaction and remuneration (i.e. personal utility) can be costed in dollars.
1 Part A—Private career pivot decision
An uncertain career pivot is a gamble, so we model it the same way we model other gambles.
Alice is a senior software engineer in her mid-30s, making her a mid-career professional. She has been donating roughly 10% of her income to effective charities and now wonders whether to switch lanes entirely to achieve impact via technical AI safety work in one of those AI Safety jobs she’s seen advertised She has saved six months of runway funds to explore AI-safety roles — research engineering, governance, or technical coordination. Each month out of work costs her foregone income and reduced career prospects. Her question is simple: Is this pivot worth the costs?
To build the model, Alice needs to estimate four things:
1.0.1 The Stakes: What’s the upside?
Annual Surplus (\(\Delta u\)): This is the key number. It’s the difference in Alice’s total annual utility between the new AI safety role (\(u_1\)) and her current baseline (\(u_0\)). This surplus combines the change in her salary and her impact—indirectly via donations and directly by doing some fancy AI safety job.
- \(u = w + \alpha(i+d)\), where \(w\) is wage, \(i\) is impact, \(d\) is donations, and \(\alpha\) is her personal weighting of impact versus consumption.
- \(\Delta u := u_1 - u_0\).
1.0.2 The Costs: What does it cost to try?
- Burn Rate (\(c\)): This is her net opportunity cost per year while on sabbatical (e.g., foregone pay, depleted savings), measured in k$/year.
- Runway (\(\ell\)): The maximum time she’s willing to try, in years.
1.0.3 The Odds: What are her chances?
- Application Rate (\(u_1\)): The number of distinct job opportunities she can apply for per year.
- Success Probability (\(u_0\)): Her average probability of getting an offer from a single application. We assume these are independent and identically distributed (i.i.d.).
The i.i.d. assumption (each job is independent) is likely optimistic. In reality, applications are correlated: if Alice is a good fit for one role, she’s likely a good fit for others (and vice-versa). We formalise this in the next section with a notion of candidate quality distributions that captures the notion that you don’t know your “ranking” in the field, but most people are not in at the top of it, by definition.
1.0.4 The “Timer”: How to value future gains?
- Discount Rate (\(\rho\)): A continuous rate per year that captures her time preference. A higher \(\rho\) means she values immediate gains more—for example, if she expects short AGI timelines, \(\rho\) might be high).
1.1 Modeling the Sabbatical: The Decision Threshold
With these inputs, we can calculate the total expected value (EV) of her sabbatical gamble. The full derivation is in Appendix A, but here’s the result:
\[ \boxed{ \Delta \mathrm{EV}\_\rho(p) =\frac{1-e^{-(r p+\rho)\ell}}{r p+\rho}\left(\frac{\Delta u r p}{\rho}-c\right). } \]
This formula looks complex, but its logic is simple. The entire decision hinges on the sign of the bracketed term: \[ \left(\frac{\Delta u r p}{\rho}-c\right) \] This is a direct comparison between the expected gain rate (the upside \(\Delta u\), multiplied by the success rate \(rp\), and adjusted for discounting \(1/\rho\)) and the burn rate (\(c\)). The prefactor scales that value according to the length of her runway and her discount rate.
The EV is positive if and only if the gain rate beats the burn rate. This means Alice’s decision boils down to a simple question: Is her per-application success probability, \(p\), high enough to make the gamble worthwhile?
We can find the exact break-even probability, \(p^*\), by setting the gain rate equal to the burn rate. This gives a much simpler formula for her decision threshold:
\[ \boxed{\,p^*=\frac{c\,\rho}{r\,\Delta u}\,}. \]
If Alice believes her actual \(p\) is greater than this \(p^*\), the pivot has a positive expected value. If \(p < p^*\), she should not take the sabbatical, at least in these terms. [TODO clarify]
1.2 What This Model Tells Alice
This simple threshold \(p^*\) gives us a clear way to think about her decision:
- The bar gets higher: The threshold \(p^*\) increases with higher costs (\(c\)) or shorter timelines or higher impatience (\(\rho\)). If her sabbatical is expensive or she’s in a hurry, she needs to be more confident of success.
- The bar gets lower: The threshold \(p^*\) decreases with more opportunities (\(r\)) or a higher upside (\(\Delta u\)). If the job offers a massive impact gain or she can apply to many roles, she can tolerate a lower chance of success on any single one.
- Runway doesn’t change the threshold: Notice that the runway length \(\ell\) isn’t in the \(p^*\) formula. A longer runway gives her more expected value (or loss) if she does take the gamble, but it doesn’t change the break-even probability itself.
- The results are fragile to uncertainty: This model is highly sensitive to her estimates. If she overestimates her potential impact (a high \(\Delta u\)) or underestimates her time preference (a low \(\rho\)), she’ll calculate a \(p^*\) that is artificially low, making the pivot look much safer than it is.1
- The key unknown: Even with a perfectly calculated \(p^*\), Alice still faces the hardest part: estimating her actual success probability, \(p\).
That \(p\) is, essentially, her chance of getting an offer. It depends not only on the number of jobs available but crucially on the number and quality of the other applicants.
All that said, this is a relatively “optimistic” model. If Alice attaches a high value to getting her hands dirty in AI safety work, she might be willing to accept a remarkably low \(p\); we’ll see that in the worked example. Hold that thought, though, because I’ll argue that this personal decision rule can be pretty bad at maximizing total impact.
If you are using these calculations for real, be aware that our heuristics are likely overestimating Alice’s chances. Job applications are not IID. The effective number of independent shots is lower than the raw application count, reducing effective \(r\) — if your skills don’t match the first job, it is also less likely to match the second, because the jobs might be similar to each other.
1.3 Worked example
Let’s plug in some plausible representative numbers for Alice. She’s a successful software engineer taking home \(w_0=180\)k$/year, donating \(d_0=18\)k$/year post-tax, and having no on-the-job impact \(\mathcal{I}_0=0\) (i.e. no net harm, no net good). Alice earns \(w_0=180\) and donates \(d_0=18\). A target role offers \(w_1=120\), \(d_1=0\) and \(\mathcal{I}_1=100\). Set \(\alpha=1\), runway \(\ell=0.5\) years, application rate \(r=24\)/year, discount \(\rho=1/3\), burn \(c=50\). Then \(\Delta u = (120+0+100)-(180+18+0)=22\) and \[ p^*=\frac{c\rho}{r\Delta u} = \frac{50\cdot\frac{1}{3}}{24\cdot 22} \approx \boxed{3.16\%}. \] Over 6 months, the chance of at least one success at \(p^*\) is \(q^*=1-e^{-rp^*\ell}\approx \boxed{31.5\%}\). Her expected actual sabbatical length is \(\mathbb{E}[\tau]=\frac{1-e^{-rp^*\ell}}{rp^*}\approx \mathbf{0.416}\ \text{years (≈5.0 months)}\), and, conditional on success, it’s \(\mathbb{E}[\tau\mid \text{success}]\approx \mathbf{0.234}\ \text{years (≈2.8 months)}\). Under these assumptions, we expect the sabbatical to break even because the job offers enough upside to compensate for a greater-than-even risk of failure.
We plot a few values of Alice’s to visualize the trade-offs for different upsides \(\Delta u\). [TODO clarify]
If we want to play around with the assumptions, check out the interactive Pivot EV Calculator (source at danmackinlay/career_pivot_calculator).
2 Part B — Field-level model
So far, this has been Alice’s private perspective. Let’s zoom out to the field level and consider: What if everyone followed Alice’s decision rule? Is the resulting number of applicants healthy for the field? What is the optimal number of people who should try to pivot?
2.1 From personal gambles to field strategy
Our goal is to move beyond Alice’s private break-even (\(p^*\)) and calculate the field’s welfare-maximizing applicant pool size (\(K^*\)). This \(K^*\) is how many “Alices” the field can afford to have roll the dice before the costs of failures outweigh the value of successes.
To analyze this, we must shift our model in three ways:
Switch to a Public Ledger: From a field-level perspective, private wages and consumption are just transfers. They drop out of the analysis. What matters is the net production of public goods (i.e., impact).
Distinguish Public vs. Private Costs: The costs are now different.
- Private Cost (Part A): \(c\) included Alice’s full opportunity cost (foregone wages, etc.).
- Public Cost (Part B): We now use \(\gamma\), which captures only the foregone public good during a sabbatical (e.g., \(\gamma = \mathcal {I}_0 + d_0 + \varepsilon\), or baseline impact + baseline donations + externalities).
Move from Dynamic Search to Static Contest: Instead of one person’s dynamic search, we’ll use a static “snapshot” model of the entire field for one year. We assume there are \(N\) open roles and \(K\) total applicants.
In Part A, Alice saw jobs arriving one-by-one (a Poisson process with rate \(r\)). In Part B, we are modeling an annual “contest” with \(K\) applicants competing for \(N\) jobs.
We can bridge these two views by setting \(N \approx r\). This treats the entire year’s worth of job opportunities as a single “batch” to be filled from the pool of \(K\) candidates who are “on the market” that year.
This is a standard simplification. It allows us to stop worrying about the timing of individual applications and focus on the quality of the matches, which is determined by the size of the applicant pool (\(K\)). We can then compare the total Present Value (PV) of the benefits (better hires) against the total PV of the costs (failed sabbaticals).
Here’s a paragraph to bridge those two concepts. I’d suggest placing this just before the first plot in Part B, where you start to visualize \(W (K)\).
If \(N\) jobs are available annually (which we’ve already equated to Alice’s application rate \(r\)) and \(K\) total applicants are competing for them, a simple approximation for the per-application success probability is that it’s proportional to the ratio of jobs to applicants.
For the rest of this analysis, we’ll assume a simple mapping: \(p \approx N/K\). This allows us to plot both models on the same chart: as the field becomes more crowded (\(K\) increases), the individual chance of success (\(p\)) for any single application shrinks.
2.2 The Field-Level Model: Assumptions
Here is the minimally complicated version of our new model:
- There are \(K\) total applicants and \(N\) open roles per year.
- Each applicant \(k\) has a true, fixed potential impact \(J^{(k)}\) drawn i.i.d. from the talent distribution \(F\).
- Employers perfectly observe \(J^{(k)}\) and hire the \(N\) best candidates. (This is a strong, optimistic assumption about hiring efficiency).
- Applicants do not know their own \(J^{(k)}\), only the distribution \(F\).
The intuition is that the field benefits from a larger pool \(K\) because it increases the chance of finding high-impact candidates. But the field also pays a price for every failed applicant.
2.3 Benefits vs. Costs on the Public Ledger
Let’s define the two sides of the field’s welfare equation.
The Marginal Benefit (MV) of a Larger Pool
The benefit of a larger pool \(K\) is finding better candidates. We care about the marginal value of adding one more applicant to the pool, which we define as \(\mathrm{MV}_K\). This is the expected annual impact increase from widening the pool from \(K\) to \(K+1\). (Formally, \(\mathrm{MV}_K := \mathbb{E}[S_{N,K+1}] - \mathbb{E}[S_{N,K}]\), where \(S_{N,K}\) is the total impact of the top \(N\) hires from a pool of \(K\)).
The Marginal Cost (MC) of a Larger Pool
The cost is simpler. When \(K > N\), adding one more applicant adds (on average) one more failed pivot. [TODO clarify] This failed pivot costs the field the foregone public good during the sabbatical. We defined the social burn rate per year as \(\gamma\). To compare this to the annual benefit \(\mathrm{MV}_K\), we need the total present value of this foregone impact. We call this \(L_{\text{fail},\delta}\) (the PV of one failed attempt). (This cost is derived in Appendix B as \(L_{\text{fail},\delta}=\gamma\,\frac{1-e^{-\delta\ell}}{\delta}\)).
We do not model employer congestion from reviewing lots of applicants — on the rationale that it is empirically small because employers stop looking at candidates when they’re overwhelmed (J. Horton and Vasserman 2021).2 Note, however, that we have also claimed employers perfectly observe \(J^{(k)}\), which means we are being optimistic about the field’s ability to sort candidates. Maybe we could model a noisy search process?
2.4 Field-Level trade-offs
We can now find the optimal pool size \(K^*\). The total public welfare \(W (K)\) peaks when the marginal benefit of one more applicant equals the marginal cost.
As derived in Appendix B, the total welfare \(W(K)\) is maximized when the present value of the annual benefit stream from the marginal applicant (\(\mathrm{MV}_K / \delta\)) equals the total present value of their failure cost (\(L_{\text{fail},\delta}\)). \[ \frac{\mathrm{MV}_K}{\delta} = L_{\text{fail},\delta} \] Substituting the expression for \(L_{\text{fail},\delta}\) and cancelling the discount rate \(\delta\), we get a very clean threshold: \[ \boxed{\,\mathrm{MV}_K = \gamma\,(1-e^{-\delta\ell})\,}. \] This equation is the core of the field-level problem. The optimal pool size \(K^*\) is the point where the expected annual marginal benefit (\(\mathrm{MV}_K\)) drops to the level of the total foregone public good from one failed sabbatical attempt.
2.5 The Importance of Tail Distributions
How quickly does \(\mathrm{MV}_K\) shrink? Extreme value theory tells us this depends entirely on the tail of the candidate-quality distribution, \(F\). The shape of the tail determines how quickly returns from widening the applicant pool diminish.
We consider two families (the specific formulas are in Appendix B):
- Light tails (e.g., Exponential): In this world, candidates are variable, but the best is not transformatively better than average. Returns diminish quickly: the marginal value \(\mathrm{MV}_K\) shrinks hyperbolically (roughly as \(1/K\)).
- Heavy tails (e.g., Fréchet): This captures the “unicorn” intuition. Returns diminish much more slowly. If the tail is heavy enough, \(\mathrm {MV}_K\) decays extremely slowly, justifying a very wide search.
2.6 Implications for Optimal Pool Size
This difference in diminishing returns has a huge effect on the optimal pool size \(K^*\). (The full solutions for \(K^*\) are in Appendix B.
With light tails, there’s a finite pool size after which turning up the hype (growing \(K\)) destroys net welfare. Every extra applicant burns \(L_{\text{fail},\delta}\) in foregone public impact while adding an \(\mathrm{MV}_K\) that shrinks rapidly.
With heavy tails, it’s different. As the tail gets heavier, \(K^*\) explodes. In very heavy-tailed worlds, very wide funnels can still be net positive. We may decide it’s worth, as a society, spending a lot of resources to find the few unicorns.
We set the expected impact per hire per year to \(\mu_{\text{imp}}=100\) (impact dollars/yr) to match Alice’s hypothetical target role; this is just for exposition.
We can, of course, plot this.
δ=0.333/yr, L_fail=8.29 impact-$ (PV)
Exponential: K*=800 (boundary), W*=25869.7
Fréchet α=1.8: K*=800 (boundary), W*=50300.4
Fréchet α=2.0: K*=800 (boundary), W*=40229.2
Fréchet α=3.0: K*=800 (boundary), W*=19116.8
- This plot shows total net welfare \(W (K)\) and marks the maximum \(K^*\) for each family, showing where total welfare peaks. The dashed line at \(K=N\) shows where failures begin: \((K>N\Rightarrow K-N\) people each impose a public cost of \(L_{\text{fail},\delta})\). The markers show \(K^*=\arg\max W(K)\), the pool size beyond which widening further would reduce total impact.
- Units: \(B(K)\) is in impact dollars per year and is converted to PV by multiplying by \(H_\delta=\frac{1}{\delta}\). The subtraction uses the discounted per-failure cost \(L_{\text{fail},\delta}=\gamma\,\frac{1-e^{-\delta\ell}}{\delta}\).
- Fréchet curves use the large-\(K\) asymptotic \(B(K)\approx s\,K^{1/\alpha}C_N\) (with \(s=\mu_{\text{imp}}/\Gamma(1-1/\alpha)\)). We could work harder to get the exact \(B(K)\) for Fréchet, but the asymptotic is good enough to illustrate the qualitative behaviour.
- We treat all future uncertainties about role duration, turnover, or project lifespan as already captured in the overall discount rate \(\delta\).
We can combine these perspectives to visualize the tension between private incentives and public welfare.
p* (private) = 3.16%
Exponential: K*=868, p(K*)≈2.76%
Fréchet α=1.8: K*=3200 (boundary), p(K*)≈0.75%
Fréchet α=2.0: K*=3200 (boundary), p(K*)≈0.75%
Fréchet α=3.0: K*=1164, p(K*)≈2.06%
This visualization combines the private and public views by assuming an illustrative mapping from pool size to success probability: \(p\approx \beta N/K\) (where \(\beta\) bundles screening efficiency; here \(\beta=1\)). The black curve (left axis) shows a candidate’s private expected value (EV) versus success probability \(p\). The coloured curves (right axis) show field welfare \(W(K)\). The private break-even point \(p^*\) (black dashed line) can fall far to the left of the field-optimal \(p(K^*)\) (coloured vertical lines). This gap represents the region where individuals may be rationally incentivized, at the field level, to enter even though the field is already saturated or oversaturated at the candidate level.
3 Part C — Counterfactual Impact and Equilibrium
Part A modeled a “naive” applicant who evaluates their pivot based on the absolute impact (\(\mathcal {I}_1\)) of the role, ignoring pool dynamics. Part B analyzed the field-level optimum (\(K^*\)), showing how the marginal value (\(\mathrm{MV}_K\)) of an applicant decreases as the pool grows.
Now we tie those together. If Alice is sophisticated and understands these dynamics and aims to maximize her counterfactual impact, would she make the same choice?
This changes the game, introducing a feedback loop where individual incentives depend on the crowd size (\(K\)), and the candidate quality distribution \((F)\).
3.1 The Counterfactual Impact Model
If we assume applicants can’t know their quality relative to the pool ex-ante (formally: applicants are exchangeable), the expected counterfactual impact of the decision to apply is exactly \(\mathrm{MV}_K\) per year.
Alice should use this in her EV calculation. However, her initial EV formula (Part A) used the impact conditional on success, not the ex-ante expected impact of her decision in isolation.
Let \(\mathcal{I}_{CF}\) be the expected counterfactual impact conditional on success. Let \(q_K\) be the probability of success given the pool size \(K\). (In the static model of Part B, with \(K+1\) applicants competing for \(N\) slots, \(q_K = N/(K+1)\)). [TODO clarify] If the attempt fails, the counterfactual impact is zero.
We can derive the relationship (see Appendix): \[ \mathrm{MV}_K = q_K \cdot \mathcal{I}_{CF}. \] Therefore, the impact, conditional on success, is: \[ \boxed{\mathcal{I}_{CF} = \frac{\mathrm{MV}_K}{q_K}.} \] We recalibrate the private decision by defining the counterfactual private surplus, \(\Delta u_{CF}\), replacing the naive absolute impact \(\mathcal{I}_1\) with the counterfactual estimate \(\mathcal{I}_{CF}\).
This changes the dynamics; previously the gamble’s value depended on the pool size only insofar as it affected the per-application success probability \(p\) and thus the overall success probability \(q_K\). Now the value of the upside also depends on the pool size \(K\). As \(K\) grows, \(\mathrm{MV}_K\) decreases, but \(q_K\) also decreases. The behavior of \(\mathcal{I}_{CF}\) depends on how these balance, which is determined by the tail of the impact distribution.
3.2 The Dynamics of Counterfactual Impact
The behavior of \(\mathcal{I}_{CF}\) leads to different implications depending on whether the recruitment pool is light-tailed or heavy-tailed.
3.2.1 Case 1: Light Tails
In a light-tailed recruitment pool (where applicants are relatively similar), the math shows (See Appendix) that the expected counterfactual impact conditional on success, \(\mathcal{I}_{CF}\), is constant and equal to the population average impact (\(\mu\)), regardless of how crowded the field is (\(K\)). \[ \mathcal{I}_{CF} = \mu \quad \text{(Light Tail)} \] Intuition: While a larger pool increases the quality of the very best hire, it also increases the quality of the person the hire displaces. In the stylized light-tailed model (Exponential distribution), these effects perfectly cancel out. More generally, in light-tailed talent pools, the gap between the hire and the displaced candidate doesn’t grow much as the pool gets larger.
Implication: If the average impact \(\mu\) is modest and the candidate skills are relatively evenly distributed, pivots involving significant pay cuts are likely to have negative EV for the average applicant, regardless of pool size.
3.2.2 Case 2: Heavy Tails
In a heavy-tailed model, “unicorns” hide in the recruiting pool. Here, \(\mathcal{I}_{CF}\) increases as the field gets more crowded (\(K\)), and — under certain assumptions — it can increase fast enough to offset the costs of sabbaticals, foregone donations, etc. For the Fréchet distribution with shape \(\alpha\), \(\mathcal{I}_{CF}\) grows proportionally to \(K^{1/\alpha}\). \[ \mathcal{I}_{CF} \propto K^{1/\alpha} \quad \text{(Heavy Tail)} \] Intuition: As \(K\) increases, the expected quality of the top candidates rises much faster than that of the candidates they displace. Success in a large pool is a strong signal that we are likely a high‑impact individual, and the gap between us and the displaced candidate is large.
Implication: In a heavy‑tailed world, the pivot can become highly attractive if the field is sufficiently crowded, even with significant pay cuts.
3.3 Alice Revisited
Alice revisited. With light‑tailed assumptions, \(\mathcal{I}_{CF}\) equals the population mean \(\mu\) and is too small to offset Alice’s pay cut and lost donations—her counterfactual surplus is negative regardless of \(K\). Under heavy‑tailed assumptions, \(\mathcal{I}_{CF}\) rises with \(K\); across a broad range of conditions, the pivot can become attractive despite large pay cuts (i.e. if Alice truly might be a unicorn). The sign and size of this effect hinge on the tail parameter and scale, which are currently unmeasured.
3.4 Visualizing Private Incentives vs. Public Welfare
We can now visualize the dynamics of private, public and counterfactual private valuations by assuming an illustrative mapping between pool size and success probability: \(p\approx \beta N/K\). This allows us to see how the incentives change as the field gets more crowded (moving left on the x-axis).
This visualization combines all three perspectives using Alice’s parameters. There are a lot of lines and assumptions wrapped up in this plot. The main takeaway: if we care about solving the problem, for many variants of this model — when trading off pivoting versus donating — we should probably donate. The only exception is if we believe the talent pool is very heavy-tailed (Fréchet with \(\alpha \leq 2\)), in which case, if we are one of those unicorns, we should probably pivot. Otherwise, donating is likely to have higher expected impact.
Left Axis (Private EV):
- A (Black Solid): The naive applicant’s EV (Part A). It crosses zero at the naive break-even \(p^* \approx 3.16\%\).
- C (Colored Dashed): The sophisticated applicant’s EV (Part C), using counterfactual impact \(\Delta u_{CF}(K)\). The point where these curves cross zero defines the equilibrium \(K_{eq}\).
Right Axis (Public Welfare):
- B (Colored Solid): The field’s total welfare \(W(K)\) (Part B). The peak defines the social optimum \(K^*\).
The Information Gap (A vs C): The Naive EV (Black) is significantly higher than the Counterfactual EV (Colored Dashed) across most of the range. Applicants relying on naive valuations of the impact of a career pivot (using personal impact change \(\Delta \mathcal{I}\) instead of counterfactual impact change \(\Delta \mathcal{I}_{CF}\)) will drastically overestimate their counterfactual impact and thus the expected value of the pivot.
The Impact of Costs vs. Tails:
- In light-tailed talent pools (Exponential, Purple; Fréchet \(\alpha=3.0\), Red), the Counterfactual EV is always negative. Alice’s 78k financial loss dominates the expected impact. The equilibrium is minimal (\(K_{eq}=N\)), leading to Under-Entry relative to the optimum (\(K^*\)).
- In heavy-tailed talent pools (Fréchet \(\alpha=2.0\), Green; \(\alpha=1.8\), Orange), the dynamics change dramatically.
Complex Dynamics in Heavy Tails (The “Hump Shape”): For heavy tails (Green, Orange dashed lines), the Counterfactual EV is non-monotonic. It starts positive, increases as \(K\) grows (because \(\mathcal {I}_{CF}(K)\) increases rapidly), and eventually decreases as the success probability \(p (K)\) drops too low.
The Structural Misalignment (B vs C): In heavy-tailed talent pools, the equilibrium \(K_{eq}\) is vastly larger than the optimum \(K^*\). The efficient search process (high \(r\)) means the private cost of trying is low, which incentivizes entry long past the social optimum. This leads to massive over-entry. (For example, in the \(\alpha=2.0\) case, \(K^*\) is around 30k, while \(K_{eq}\) is over 400k).
This visualization confirms the analysis: the system’s calibration is highly sensitive to the tail distribution and private costs. Depending on the parameters, the system can structurally incentivize either severe under-entry or massive over-entry, even when applicants are sophisticated.
3.5 Equilibrium vs. Optimum
This feedback mechanism—where incentives depend on \(K\)—creates a natural equilibrium. Applicants will enter until the EV for the marginal entrant is zero. This defines the equilibrium candidate pool size, \(K_{eq}\).
To analyze this, we must reintegrate the counterfactual surplus \(\Delta u_{CF}(K)\) into the dynamic search model (Part A). We assume the pool size \(K\) determines the surplus \(\Delta u_{CF}(K)\) and the per-application success probability \(p(K)\). The equilibrium \(K_{eq}\) occurs when the expected gain rate equals the burn rate (the bracketed term in the EV formula is zero): \[ \frac{\Delta u_{CF}(K)\,r p(K)}{\rho} = c. \] Does this equilibrium \(K_{eq}\) align with the socially optimal pool size \(K^*\) (Part B)?
Generally, no. Whether they align depends on how private costs (\(c\)), social costs (\(\gamma\)) and the efficiency of the job-search process (\(r\)) compare.
3.6 Search Efficiency
The equilibrium condition depends on the application rate \(r\). We can rewrite the equilibrium condition as: \[ \Delta u_{CF}(K) \cdot p(K) = \frac{c\rho}{r}. \] The left side is the expected counterfactual surplus per application attempt. The right side, \(\frac{c\rho}{r}\), represents the effective private cost hurdle per application attempt (scaled by the discount rate).
If the job search process is highly efficient (high \(r\)), the private cost hurdle is low. This encourages people to apply even when the expected counterfactual impact per application is small, because trying is cheap.
We can compare this private incentive to the social optimum. As derived in Appendix C.3, if the private cost hurdle (\(\frac{c\rho}{r}\)) is significantly lower than the social cost of failure (related to \(\gamma\)), the system structurally leads to Over-Entry (\(K_{eq} > K^*\)).
Let’s check Alice’s numbers: \(c=50, \rho=1/3, r=24\). The private cost hurdle is \(\frac{c\rho}{r} \approx \frac{50/3}{24} \approx 0.69k\). The social cost rate \(\gamma\) (foregone donations) is \(18k\).
Since \(0.69k \ll 18k\), the system strongly favours over-entry. The efficiency of the search process dramatically lowers the private barrier to entry compared to the social costs incurred.
For example, in the heavy-tailed case (\(\alpha=2\)), we might find \(K^* \approx 11,600\), while \(K_{eq} \approx 178,000\).
4 Implications and Solutions
The analysis suggests the AI safety field may be oversubscribed. The core problem is misalignment: organizations influencing the funnel size don’t internalize the costs borne by unsuccessful applicants. This incentivizes maximizing application volume (a visible proxy) rather than welfare-maximizing matches—a classic setup for Goodhart’s Law.
A healthy field can rationally accept high individual failure rates if it measures and communicates the odds. If the field doesn’t measure them, the same logic becomes waste. The ethical burden shifts when the system knowingly asks people to take low-probability gambles without making that explicit.
4.1 For Individuals: Knowing the Game
For mid-career individuals, the decision is high-stakes. (For early-career individuals, costs \(c\) are lower, making the gamble more favourable, but the need to estimate \(p\) remains.)
- Calculate your threshold (\(p^*\)): Use the model in Part A (and the linked calculator). Without strong evidence that \(p > p^*\) is true, a pivot involving significant unpaid time is likely EV-negative.
- Seek cheap signals: Seek personalized evidence of fit—such as applying to a few roles before leaving your current job—before committing significant resources.
- Use grants as signals: Organizations like Open Philanthropy offer career transition grants. These serve as information gates. If received, a grant lowers the private cost (\(c\)). If denied, it is a valuable calibration signal. If a major funder declines to underwrite the transition, candidates should update \(p\) downwards. (If you don’t get that Open Phil transition grant, don’t quit your current job.)
4.2 For Organizations: Transparency and Feedback
Employers and advice organizations control the information flow. Unless they provide evidence-based estimates of success probabilities, their generic encouragement should be treated with scepticism.
- Publish stage-wise acceptance rates (Base Rates). Employers must publish historical data (applicants, interviews, offers) by track and seniority. This is the single most impactful intervention for anchoring \(p\).
- Provide informative feedback and rank. Employers should provide standardized feedback or an indication of relative rank (e.g., “top quartile”). This feedback is costly, but this cost must be weighed against the significant systemic waste currently externalized onto applicants and the long-term credibility of the field.
- Track advice calibration. Advice organizations should track and publish their forecast calibration (e.g., Brier scores) regarding candidate success. If an advice organization doesn’t track outcomes, its advice cannot be calibrated except by coincidence.
4.3 For the Field: Systemic Calibration
To optimize the funnel size, the field needs to measure costs and impact tails.
- Estimate applicant costs (\(c\ell\)). Advice organizations or funders should survey applicants (successful and unsuccessful) to estimate typical pivot costs.
- Track realized impact proxies. Employers should analyze historical cohorts to determine if widening the funnel is still yielding significantly better hires, or if returns are rapidly diminishing.
- Experiment with mechanism design. In capacity-constrained rounds, implementing soft caps—pausing applications after a certain number—can reduce applicant-side waste without significantly harming match quality (J. J. Horton et al. 2024).
5 Where next?
I’d like feedback from people deeper in the AI safety career ecosystem. I’d love to chat with people from 80,000 Hours, MATS, FHI, CHAI, Redwood Research, Anthropic, etc., about this. What is your model about the candidate impact distribution, the tail behaviour, and the costs? What have I got wrong? What have I missed? I’m open to the possibility that this is well understood and being actively managed behind the scenes, but I haven’t seen it laid out this way anywhere.
6 Further reading
Resources that complement the mechanism-design view of the AI safety career ecosystem:
- Christopher Clay, AI Safety’s Talent Pipeline is Over-optimised for Researchers
- AI Safety Field Growth Analysis 2025
- Why experienced professionals fail to land high-impact roles Context deficits and transition traps that explain why even strong senior hires often bounce out of the AI safety funnel.
- Levelling Up in AI Safety Research Engineering — EA Forum. A practical upskilling roadmap; complements the “lower \(c\), raise \(V\), raise \(p\)” levers by reducing risk before a pivot.
- SPAR AI — Safety Policy and Alignment Research program. An example of a program that provides structured training and, implicitly, some “negative previews” of the grind of AI safety work.
- MATS retrospectives — LessWrong. Transparency on acceptance rates, alumni experiences, and obstacles faced in this training program.
- Why not just send people to Bluedot on FieldBuilding Substack. A critique of naive funnel-building and the hidden costs of over-sending candidates to “default” programs.
- How Stuart Russell’s IASEAI conference failed to live up to its potential (FBB #8) — EA Forum. A cautionary tale about how even well-intentioned field-building efforts can misfire without mechanism design.
- 80,000 Hours career change guides. Practical content on managing costs, transition grants, and opportunity cost—useful for calibrating \(c\) in the pivot-EV model.
- Forecasting in personal decisions — 80k. Advice on making and updating stage-wise probability forecasts; relevant to candidate calibration.
- AI safety technical research - Career review
- Updates to our research about AI risk and careers - 80,000 Hours
- The case for taking your technical expertise to the field of AI policy - 80,000 Hours
- Center for the Alignment of AI Alignment Centers. A painfully relatable satire that deserves citing here.
- AMA: Ask Career Advisors Anything — EA Forum
7 Appendix A: Private Decision Model Derivations
We model the career pivot attempt as a continuous-time process during a sabbatical of maximum length \(\ell\).
Setup:
- Job opportunities arrive as a Poisson process with rate \(r\).
- The per-application success probability is \(p\) (i.i.d.).
- The success process is a Poisson process with rate \(\lambda = rp\).
- The time to the first success is \(T_1 \sim \mathrm{Exp}(\lambda)\).
- The actual sabbatical duration is the stopping time \(\tau = \min\{T_1, \ell\}\).
- The continuous discount rate is \(\rho>0\).
- The annual utility surplus if the pivot succeeds is \(\Delta u\).
- The burn rate during the sabbatical is \(c\).
7.1 Sabbatical Duration and Success Statistics
The probability of success within the runway is: \[ q = P(T_1 \le \ell) = 1 - e^{-\lambda \ell} = 1 - e^{-r p \ell}. \] We calculate the expected duration \(\mathbb{E}[\tau]\) using the survival function \(P(\tau > t)\). For \(t \in [0, \ell]\), \(\tau > t\) holds if and only if no success has occurred by time \(t\), so \(P(\tau > t) = P(T_1 > t) = e^{-\lambda t}\). This gives \(P(\tau>t)=0\) for \(t>\ell\). \[ \mathbb{E}[\tau] = \int_0^\infty P(\tau > t)\,dt = \int_0^\ell e^{-\lambda t}\,dt = \frac{1 - e^{-\lambda \ell}}{\lambda}. \] The expected duration, conditional on success, \(\mathbb{E}[\tau\mid \text{success}] = \mathbb{E}[T_1 \mid T_1 \le \ell]\), is given by the truncated exponential distribution. The PDF of \(T_1\) conditional on \(T_1 \le \ell\) is \(f(t\mid T_1\le\ell) = \frac{\lambda e^{-\lambda t}}{1-e^{-\lambda\ell}}\) for \(t\in[0,\ell]\). Using integration by parts, we get: \[ \begin{aligned} \mathbb{E}[\tau\mid \text{success}] &= \frac{1}{1-e^{-\lambda\ell}} \int_0^\ell t \lambda e^{-\lambda t}\,dt \ &= \frac{1}{1-e^{-\lambda\ell}}\left( \left[-t e^{-\lambda t}\right]_0^\ell + \int_0^\ell e^{-\lambda t}\,dt \right) \ &= \frac{1}{1-e^{-\lambda\ell}}\left( -\ell e^{-\lambda\ell} + \frac{1-e^{-\lambda\ell}}{\lambda} \right) \ &= \frac{1}{\lambda} - \frac{\ell e^{-\lambda\ell}}{1-e^{-\lambda\ell}}. \end{aligned} \]
7.2 Derivation of the Expected Present Value (\(\Delta \mathrm{EV}_\rho(p)\))
The expected value of a pivot attempt equals the expected discounted benefit minus the expected discounted cost.
Expected Discounted Benefit (\(\mathbb{E}[B]\)): If the pivot succeeds at time \(T_1=t \le \ell\), the benefit is the present value (PV) of the stream \(\Delta u\) starting at \(t\): \(B(t) = \int_t^\infty \Delta u\,e^{-\rho (s-t)}e^{-\rho t}\,ds = \frac{\Delta u}{\rho}e^{-\rho t}\). We take the expectation over the time of success \(T_1\), up to the runway limit \(\ell\), using the density \(f_{T_1}(t) = \lambda e^{-\lambda t}\): \[ \begin{aligned} \mathbb{E}[B] &= \int_0^\ell B(t) f_{T_1}(t)\,dt = \int_0^\ell \frac{\Delta u}{\rho}e^{-\rho t} \lambda e^{-\lambda t}\,dt \ &= \frac{\Delta u \lambda}{\rho} \int_0^\ell e^{-(\lambda+\rho)t}\,dt \ &= \frac{\Delta u \lambda}{\rho(\lambda+\rho)} (1-e^{-(\lambda+\rho)\ell}). \end{aligned} \]
Expected Discounted Cost (\(\mathbb{E}[C]\)): The cost is incurred at rate \(c\) during the sabbatical \([0, \tau]\). We compute \(\mathbb{E}\left[\int_0^\tau c e^{-\rho t} dt\right]\). We swap expectation and integration (by Fubini’s theorem, since the integrand is positive): \[ \mathbb{E}[C] = c \int_0^\infty e^{-\rho t} \mathbb{E}[\mathbb{I}(t < \tau)] dt = c \int_0^\infty e^{-\rho t} P(\tau > t) dt. \] We use the survival function \(P(\tau > t) = e^{-\lambda t}\) for \(t \in [0, \ell]\): \[ \mathbb{E}[C] = c \int_0^\ell e^{-\rho t} e^{-\lambda t} dt = c \int_0^\ell e^{-(\lambda+\rho)t} dt = c \frac{1-e^{-(\lambda+\rho)\ell}}{\lambda+\rho}. \]
Total Expected Value: \[ \Delta \mathrm{EV}_\rho(p) = \mathbb{E}[B] - \mathbb{E}[C]. \] We factor out the common term \(\frac{1-e^{-(\lambda+\rho)\ell}}{\lambda+\rho}\) (the expected discounted duration) and substitute \(\lambda=rp\): \[ \boxed{ \Delta \mathrm{EV}_\rho(p) = \frac{1-e^{-(rp+\rho)\ell}}{rp+\rho} \left(\frac{\Delta u\,rp}{\rho} - c\right). } \]
7.3 Break-even Probability (\(p^*\))
The EV is zero whenever the term in brackets vanishes (since the prefactor is strictly positive). \[ \frac{\Delta u\,rp_\rho^*}{\rho} - c = 0 \implies \boxed{p_\rho^* = \frac{c\rho}{r\Delta u}}. \]
8 Appendix B: Field-Level Model Derivations
We analyze the field-level optimum using a public ledger in impact dollars.
Setup:
- \(K\) applicants, \(N\) seats. Impacts are \(J^{(k)} \sim F\) i.i.d.
- Hires are the top \(N\) candidates: \(J_{(K)} \ge J_{(K-1)} \ge \dots\).
- Total annual impact from the top \(N\): \(S_{N,K}:=J_{(K)}+J_{(K-1)}+\dots+J_{(K-N+1)}\).
- Expected annual benefit: \(B(K) = \mathbb{E}[S_{N,K}]\).
- Marginal value: \(\mathrm{MV}_K = B(K+1) - B(K)\).
- Social discount rate: \(\delta\).
- Social burn rate (foregone public impact): \(\gamma := \mathcal{I}_0 + d_0 + \varepsilon\).
We don’t model congestion costs. Generally, employers who’ve filled a given role can ignore excess applications, and there’s a lot of evidence that they do (J. Horton, Kerr, and Stanton 2017; J. Horton and Vasserman 2021; J. J. Horton et al. 2024).
8.1 Welfare Function and Optimality
Present-value horizon: \(H_\delta = \int_0^\infty e^{-\delta t}\,dt = 1/\delta\).
PV of a failed attempt: Assuming a failed attempt uses the full runway \(\ell\) (this simplifies calculating the marginal cost of an additional applicant): \[ L_{\text{fail},\delta} = \int_0^\ell \gamma e^{-\delta t}\,dt = \gamma\frac{1-e^{-\delta\ell}}{\delta}. \]
Total Welfare (\(W(K)\)): The total welfare \(W(K)\) is the present value of the benefits from the \(N\) hires minus the present value of the costs of all \((K-N)\) failures. \[ W(K) = B(K) \cdot H_\delta - \max\{K-N, 0\} \cdot L_{\text{fail},\delta}. \] The welfare-maximizing pool size \(K^*\) (for \(K>N\)) is where the marginal benefit equals the marginal cost. Adding one applicant produces exactly one expected failure in the pool, so the marginal cost is \(L_{\text{fail},\delta}\). \[ \mathrm{MV}_K \cdot H_\delta = L_{\text{fail},\delta}. \] Substituting the expressions and cancelling \(1/\delta\): \[ \boxed{\mathrm{MV}_K = \gamma (1-e^{-\delta\ell}).} \] This is the optimality condition used in the main text.
8.2 Distribution-Specific Results
We solve for \(K^*\) based on the behaviour of \(\mathrm{MV}_K\) for different distributions \(F\).
8.2.1 Exponential Distribution (Light Tail)
Let \(J \sim \mathrm{Exp}(\lambda)\) have mean \(1/\lambda\). The expected sum of the top \(N\) order statistics out of \(K\) draws has a known closed form, often derived using the Rényi representation of exponential spacings: \[ B(K)=\frac{N}{\lambda}\Bigl(1+H_K-H_N\Bigr), \] Where \(H_K = \sum_{k=1}^K \frac{1}{k}\) is the \(K\)-th harmonic number.
Marginal Value: \[ \mathrm{MV}_K = B(K+1) - B(K) = \frac{N}{\lambda}(H_{K+1} - H_K) = \frac{N}{\lambda(K+1)}. \] Returns diminish hyperbolically (\(O(1/K)\)).
Optimal Pool Size \(K^*\): Set \(\mathrm{MV}_K\) equal to the marginal social cost \(\gamma (1-e^{-\delta\ell})\): \[ \boxed{K^* = \frac{N}{\lambda\gamma(1-e^{-\delta\ell})} - 1.} \]
8.2.2 Fréchet Distribution (Heavy Tail)
Let \(J \sim \text{Fréchet}(\alpha, s)\) have shape \(\alpha>1\) (necessary for a finite mean) and scale \(s\). We use asymptotic results from extreme value theory for large \(K\). The expected sum of the top \(N\) values scales as \(K^{1/\alpha}\): \[ B(K) \approx s\,K^{1/\alpha}\,C_N(\alpha), \] where \(C_N(\alpha)\) is a constant, independent of \(K\) and \(s\): \[ C_N(\alpha) := \sum_{k=1}^{N}\frac{\Gamma\bigl(k-\tfrac{1}{\alpha}\bigr)}{\Gamma(k)}. \]
Marginal Value: We approximate the marginal value by taking the derivative of the asymptotic expression: \[ \mathrm{MV}_K \approx \frac{d}{dK} B(K) = s C_N(\alpha) \frac{1}{\alpha} K^{\frac{1}{\alpha}-1}. \] Returns diminish as a power law (\(O(K^{-(1-1/\alpha)})\)), slower than exponential.
Optimal pool size \(K^*\): We set \(\mathrm{MV}_K\) equal to the marginal social cost \(\gamma (1-e^{-\delta\ell})\) and solve for \(K\): \[ \frac{s C_N(\alpha)}{\alpha} (K^*)^{\frac{1}{\alpha}-1} = \gamma (1-e^{-\delta\ell}). \] \[ \boxed{K^* = \left(\frac{s\,C_N(\alpha)}{\alpha\,\gamma\,(1-e^{-\delta\ell})}\right)^{\frac{\alpha}{\alpha-1}}.} \] As \(\alpha \downarrow 1\) (heavier tails) increases, the exponent \(\frac{\alpha}{\alpha-1} \to \infty\) causes \(K^*\) to explode.
8.3 Plotting Parameters (for \(W(K)\) curves)
To plot the total welfare curves \(W(K)\), we need the total benefit \(B(K)\) and must normalize the distributions so they have the same mean impact, \(\mu_{\text{imp}}\), for a fair comparison. We set \(\mu_{\text{imp}}=100\) impact-dollars/yr in the main text.
- Exponential:
- Mean: \(1/\lambda=\mu_{\text{imp}}\Rightarrow \lambda=1/\mu_{\text{imp}}\).
- Total Benefit (exact): \(B(K)=\mu_{\text{imp}}\,N\,\Big(1+H_K-H_N\Big)\).
- Fréchet (\(\alpha>1\)):
- Mean: \(\mathbb{E}[J]=s\,\Gamma(1-1/\alpha)=\mu_{\text{imp}}\Rightarrow s=\mu_{\text{imp}}/\Gamma(1-1/\alpha)\).
- Total Benefit (asymptotic): \(B(K)\approx s\,C_N(\alpha)\,K^{1/\alpha}\).
- (Where \(C_N(\alpha)\) is defined above).
We analyze the field-level optimum using a public ledger in impact dollars.
9 Appendix C: Counterfactual Impact and Equilibrium Derivations
9.1 Derivation of \(\mathcal{I}_{CF}\)
We relate the Marginal Value of entry (\(\mathrm{MV}_K\)) to the expected counterfactual impact conditional on success (\(\mathcal{I}_{CF}\)).
\(\mathrm{MV}_K\) is the expected increase in total field impact when an applicant joins the pool (moving from \(K\) to \(K+1\) applicants). Let \(S\) be the event of success (being hired). \(q_K\) is the probability of success \(P(S)\). [TODO clarify] We assume applicants are exchangeable.
By the law of total expectation: \[ \mathrm{MV}_K = \mathbb{E}[\text{Impact of Entry} \mid S] P(S) + \mathbb{E}[\text{Impact of Entry} \mid \neg S] P(\neg S). \] If an applicant enters and fails (\(\neg S\)), their counterfactual impact is 0. The expected impact, conditional on success, is \(\mathcal{I}_{CF}\).
Therefore: \[ \mathrm{MV}_K = \mathcal{I}_{CF} \cdot q_K. \] We assume exchangeable applicants are competing for \(N\) slots in a pool of \(K+1\) and \(q_K = N/(K+1)\). \[ \mathcal{I}_{CF} = \frac{\mathrm{MV}_K}{q_K} = \mathrm{MV}_K \cdot \frac{K+1}{N}. \]
9.2 \(\mathcal{I}_{CF}\) for distributions
Exponential (light-tailed): The mean impact is \(\mu\). See Appendix B for \(\mathrm{MV}_K = N\mu/(K+1)\). \[ \mathcal{I}_{CF} = \frac{N\mu/(K+1)}{N/(K+1)} = \mu. \] We find the expected counterfactual impact, conditional on success, is constant.
Fréchet (heavy tail):
Appendix B shows \(\mathrm{MV}_K \propto K^{1/\alpha-1}\) (valid asymptotically for large K). \(q_K \propto 1/K\). \[ \mathcal{I}_{CF} = \frac{\mathrm{MV}_K}{q_K} \propto \frac{K^{1/\alpha-1}}{1/K} = K^{1/\alpha}. \] The expected counterfactual impact conditional on success grows with the pool size \(K\).
9.3 Equilibrium Condition and Misalignment
The equilibrium \(K_{eq}\) occurs when the private EV — computed using the counterfactual surplus \(\Delta u_{CF}(K)\) — is zero. That happens when the bracketed term in the EV formula (Part A) is zero: \[ \frac{\Delta u_{CF}(K)\,r p(K)}{\rho} - c = 0 \implies \Delta u_{CF}(K) \cdot p(K) = \frac{c\rho}{r}. \] The RHS, \(\frac{c\rho}{r}\), is the effective private cost hurdle per application attempt.
We compare it to the social optimality condition \(K^*\), defined in Part B. For simplicity, we approximate the social cost of failure, \(L_{\text{fail},\delta} \approx \gamma/\delta\), assuming large \(\ell\), and set \(\delta=\rho\). The optimality condition \(\mathrm{MV}_K \cdot H_\delta = L_{\text{fail},\delta}\) becomes: \[ \mathrm{MV}_K = \gamma. \]
Analyzing Over/Under Entry:
To illustrate the misalignment, consider a simplified case where private financial losses (pay cuts) are negligible, so \(\Delta u_{CF}(K) \approx \mathcal{I}_{CF}(K)\). Also, assume that the per-application success rate \(p(K)\) approximates the overall success probability \(q_K\).
In this case, \(\Delta u_{CF}(K) \cdot p(K) \approx \mathcal{I}_{CF}(K) \cdot q_K = \mathrm{MV}_K\).
The private equilibrium condition simplifies to: \(\mathrm{MV}_K = \frac{c\rho}{r}\). The social optimum condition remains: \(\mathrm{MV}_K = \gamma\).
Since \(\mathrm{MV}_K\) is decreasing in \(K\), \(K_{eq} > K^*\) (Over-Entry) occurs if the private threshold is lower than the social threshold: \[ \frac{c\rho}{r} < \gamma. \] This happens when the private cost hurdle per attempt is lower than the social cost rate of failure. As the main text shows using Alice’s parameters (0.69k vs 18k), the inequality often holds strongly, indicating a structural tendency toward over-entry even when agents use sophisticated counterfactual reasoning.




