Flavours of Bayesian conditioning
Conditional expectation and probability
2020-02-03 — 2026-01-06
Wherein Jeffrey conditioning is presented as rescaling partition weights while preserving within‑cell conditionals, and a noisy‑coin example is used to compare numeric posteriors from two updating routes.
Bayes’ rule tells us how to condition data on observations. So far, so uncontroversial. However, there’s a lot more to understand about what it means, and how we can generalize and relax it.
Here are some morsels I should be aware of.
1 Todo
Things I’d like to re-derive for my own entertainment:
Conditioning in the sense of measure-theoretic probability. Kolmogorov formulation: conditioning as a Radon–Nikodym derivative. Clunkiness of the definition due to niceties of Lebesgue integration.
2 Conditional algebra
TBC. See e.g. (Mečíř 2020; Taraldsen 2019).
3 Nonparametric
Conditioning in full measure-theoretic glory for Bayesian nonparametrics. E.g., conditioning of Gaussian Processes is fun.
4 Learned conditioning
5 Generalized conditioning 1: Jeffreys
When I think about belief updating, I usually mean the classic Bayesian story: there is an unknown parameter \(\theta\), I observe some data \(y\), and I update by Bayes’ rule.
A common way to present this is with a latent “true” signal \(x\) and a noisy measurement \(y\): \[ p(\theta \mid y) \propto p(\theta)\,p(y\mid \theta), \qquad p(y\mid \theta)=\int p(y\mid x)\,p(x\mid \theta)\,dx. \] This is bread-and-butter Bayesian inference: pick a noise model \(p(y\mid x)\), marginalize over \(x\), and we get a likelihood for \(\theta\).
5.1 When the evidence isn’t a likelihood
Sometimes my “evidence” doesn’t arrive as a clean observation \(y=y_0\) or as a model I’m willing to write down.
Example: I glance at a blurry test readout. After looking I come away with: > I’m 80% sure the result is positive and 20% sure it’s negative.
That’s not a statement like “the test is correct 80% of the time” (a likelihood claim). It’s a statement about my revised marginal beliefs over the partition. \[ \{B_+,B_-\}=\{\text{positive},\text{negative}\}. \] Concretely, I’m asserting new probabilities \(p'(B_+)=0.8\) and \(p'(B_-)=0.2\) without committing to any mechanism that produced my uncertainty.
This is where Jeffrey conditioning (Richard Jeffrey, not Harold Jeffreys) comes in.
6 Jeffrey’s rule
Start with a prior probability \(p(\cdot)\) over a space of possibilities and a partition \(\{B_i\}\). Suppose we decide our updated beliefs must satisfy the marginal weights \(p'(B_i)\), but we do not learn which \(B_i\) is true.
Jeffrey’s proposal is:
- Keep the old conditional beliefs within each cell: \[ p'(A\mid B_i)=p(A\mid B_i)\quad\text{for all }A,i, \]
- Then the updated probability of any event \(A\) is the mixture \[ p'(A)=\sum_i p(A\mid B_i)\,p'(B_i). \]
An equivalent pointwise form is: for \(\omega\in B_i\), \[ p'(\omega)=p(\omega\mid B_i)\,p'(B_i), \] That is, we rescale each partition block and leave its internal shape unchanged.
Two quick sanity checks:
- Ordinary conditioning is a special case. If we become certain that \(B_j\) happened (so \(p'(B_j)=1\)), then \[ p'(A)=p(A\mid B_j). \]
- Minimal-change interpretation. Jeffrey updating is the KL projection (the “I-projection”): among all distributions \(q\) satisfying \(q(B_i)=p'(B_i)\), it chooses the \(q\) minimizing \(D_{\mathrm{KL}}(q\|p)\). Intuitively, we enforce the new marginals with the smallest departure from the prior.
The key assumption—sometimes called probability kinematics—is the rigidity condition \(p'(A\mid B_i)=p(A\mid B_i)\). If the evidence would also change how we reason within a cell, Jeffrey updating is the wrong choice.
7 Contrast with hierarchical Bayes
This genuinely differs from the noisy-observation route.
Hierarchical Bayes (mechanism-based). We write down how observations are generated, \(p(y\mid x)\), combine that with \(p(x\mid \theta)\), and update. The posterior typically does change “inside” the partition cells because the likelihood interacts with \(\theta\) and \(x\).
Jeffrey (constraint-based). We do not model the data-generating mechanism. We directly impose new marginals \(p'(B_i)\) and otherwise stay as close as possible to the prior, which forces the within-cell conditionals to remain as they were.
They coincide only in special cases where the mechanism-driven update happens to leave the relevant conditionals \(p(A\mid B_i)\) intact while producing exactly the same new marginal weights.
8 A toy example: a noisy coin flip
Imagine a coin that is either fair (\(\theta=0.5\)) or heavily biased toward heads (\(\theta=0.9\)). Prior: \[ p(\theta=0.5)=p(\theta=0.9)=0.5. \]
Let \(x\in\{\mathrm{H},\mathrm{T}\}\) be the true flip. Suppose we have a sensor reading \(y\) that reports the flip correctly with probability \(0.8\) and incorrectly with probability \(0.2\).
8.1 Route 1: Hierarchical Bayes with a noise model
Model: - \(p(x=\mathrm{H}\mid \theta)=\theta\) (Bernoulli\((\theta)\)) - \(p(y=x\mid x)=0.8\), \(p(y\neq x\mid x)=0.2\)
If the sensor reports \(y=\mathrm{H}\), then \[ p(y=\mathrm{H}\mid \theta)=0.8\,p(x=\mathrm{H}\mid \theta)+0.2\,p(x=\mathrm{T}\mid \theta) =0.8\theta+0.2(1-\theta). \] So: - \(\theta=0.5:\; p(y=\mathrm{H}\mid \theta)=0.5\) - \(\theta=0.9:\; p(y=\mathrm{H}\mid \theta)=0.74\)
Posterior weights \(\propto\), prior \(\times\), likelihood: - \(\theta=0.5:\; 0.5\times 0.5=0.25\) - \(\theta=0.9:\; 0.5\times 0.74=0.37\)
Normalize: \[ p(\theta=0.5\mid y=\mathrm{H})\approx 0.403,\qquad p(\theta=0.9\mid y=\mathrm{H})\approx 0.597. \]
8.2 Route 2: Jeffrey conditioning
Now suppose we don’t trust or specify a sensor model. All we’re willing to assert is a revised marginal distribution over the partition \(\{x=\mathrm{H},x=\mathrm{T}\}\): \[ p'(x=\mathrm{H})=0.8,\qquad p'(x=\mathrm{T})=0.2. \] NB: This is a constraint on our beliefs about \(x\), not a statement that “the sensor is 80% accurate”. Accuracy is about \(p(y\mid x)\); \(p'(x=\mathrm{H})\) is about our posterior marginal of \(x\). Conflating these two is a category error.
Jeffrey’s rule updates \(\theta\) by mixing the old posteriors given each cell: \[ p'(\theta)=\sum_{x\in\{\mathrm{H},\mathrm{T}\}} p(\theta\mid x)\,p'(x). \]
Compute \(p(\theta\mid x)\propto p(\theta)p(x\mid \theta)\):
- If \(x=\mathrm{H}\), the weights \((0.5\times 0.5,\; 0.5\times 0.9)=(0.25,0.45)\) normalize to \((0.357,0.643)\).
- If \(x=\mathrm{T}\), the weights \((0.5\times 0.5,\; 0.5\times 0.1)=(0.25,0.05)\) normalize to \((0.833,0.167)\).
Now mix with \((0.8,0.2)\): \[ p'(\theta=0.5)=0.8(0.357)+0.2(0.833)\approx 0.452,\\ p'(\theta=0.9)=0.8(0.643)+0.2(0.167)\approx 0.548. \]
8.3 Comparing the two
- Hierarchical Bayes (with an explicit sensor model) gave about \((0.403,\,0.597)\) for \((\theta=0.5,\theta=0.9)\).
- Jeffrey (imposing only new marginals on \(x\)) gave about \((0.452,\,0.548)\).
Both updates move towards the biased coin, but Jeffrey is less decisive because it doesn’t “reach inside” the partition cells: it only adjusts the mixture weights \(p'(x=\mathrm{H}),p'(x=\mathrm{T})\) and keeps \(p(\theta\mid x)\) fixed.
That’s the trade:
- If we’re willing to model the mechanism generating our evidence, we can extract sharper information.
- If all we trust is a set of revised marginal constraints, Jeffrey conditioning is the coherent minimal-change update.
9 Disintegration
10 BLUE in Gaussian conditioning
E.g. Wilson et al. (2021):
Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space and denote by \((\boldsymbol{a}, \boldsymbol{b})\) a pair of square integrable, centred random variables on \(\mathbb{R}^{n_{a}} \times \mathbb{R}^{n_{b}}\). The conditional expectation is the unique random variable that minimises the optimization problem \[ \mathbb{E}(\boldsymbol{a} \mid \boldsymbol{b})=\underset{\hat{\boldsymbol{a}}=f(\boldsymbol{b})}{\arg \min } \mathbb{E}(\hat{\boldsymbol{a}}-\boldsymbol{a})^{2} \] In words then, \(\mathbb{E}(\boldsymbol{a} \mid \boldsymbol{b})\) is the measurable function of \(\boldsymbol{b}\) that best predicts \(\boldsymbol{a}\) in the sense of minimizing the mean square error \((6)\).
Uncorrelated, jointly Gaussian random variables are independent. Consequently, when \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are jointly Gaussian, the optimal predictor \(\mathbb{E}(\boldsymbol{a} \mid \boldsymbol{b})\) manifests as the best unbiased linear estimator \(\hat{\boldsymbol{a}}=\mathbf{S} \boldsymbol{b}\) of \(\boldsymbol{a}\)
11 Incoming
H.H. Rugh’s answer is nice.
