Flavours of Bayesian conditioning

Conditional expectation and probability

2020-02-03 — 2026-01-06

Wherein Jeffrey conditioning is presented as rescaling partition weights while preserving within‑cell conditionals, and a noisy‑coin example is used to compare numeric posteriors from two updating routes.

algebra

Bayes

functional analysis

networks

probability

Bayes’ rule tells us how to condition data on observations. So far, so uncontroversial. However, there’s a lot more to understand about what it means, and how we can generalize and relax it.

Here are some morsels I should be aware of.

1 Todo

Things I’d like to re-derive for my own entertainment:

Conditioning in the sense of measure-theoretic probability. Kolmogorov formulation: conditioning as a Radon–Nikodym derivative. Clunkiness of the definition due to niceties of Lebesgue integration.

2 Conditional algebra

TBC. See e.g. (Mečíř 2020; Taraldsen 2019).

3 Nonparametric

Conditioning in full measure-theoretic glory for Bayesian nonparametrics. E.g., conditioning of Gaussian Processes is fun.

4 Learned conditioning

See learning to conditioning.

5 Generalized conditioning 1: Jeffreys

When I think about belief updating, I usually mean the classic Bayesian story: there is an unknown parameter \(\theta\), I observe some data \(y\), and I update by Bayes’ rule.

A common way to present this is with a latent “true” signal \(x\) and a noisy measurement \(y\): \[ p(\theta \mid y) \propto p(\theta)\,p(y\mid \theta), \qquad p(y\mid \theta)=\int p(y\mid x)\,p(x\mid \theta)\,dx. \] This is bread-and-butter Bayesian inference: pick a noise model \(p(y\mid x)\), marginalize over \(x\), and we get a likelihood for \(\theta\).

5.1 When the evidence isn’t a likelihood

Sometimes my “evidence” doesn’t arrive as a clean observation \(y=y_0\) or as a model I’m willing to write down.

Example: I glance at a blurry test readout. After looking I come away with: > I’m 80% sure the result is positive and 20% sure it’s negative.

That’s not a statement like “the test is correct 80% of the time” (a likelihood claim). It’s a statement about my revised marginal beliefs over the partition. \[ \{B_+,B_-\}=\{\text{positive},\text{negative}\}. \] Concretely, I’m asserting new probabilities \(p'(B_+)=0.8\) and \(p'(B_-)=0.2\) without committing to any mechanism that produced my uncertainty.

This is where Jeffrey conditioning (Richard Jeffrey, not Harold Jeffreys) comes in.

6 Jeffrey’s rule

Start with a prior probability \(p(\cdot)\) over a space of possibilities and a partition \(\{B_i\}\). Suppose we decide our updated beliefs must satisfy the marginal weights \(p'(B_i)\), but we do not learn which \(B_i\) is true.

Jeffrey’s proposal is:

Keep the old conditional beliefs within each cell: \[ p'(A\mid B_i)=p(A\mid B_i)\quad\text{for all }A,i, \]
Then the updated probability of any event \(A\) is the mixture \[ p'(A)=\sum_i p(A\mid B_i)\,p'(B_i). \]

An equivalent pointwise form is: for \(\omega\in B_i\), \[ p'(\omega)=p(\omega\mid B_i)\,p'(B_i), \] That is, we rescale each partition block and leave its internal shape unchanged.

Two quick sanity checks:

Ordinary conditioning is a special case. If we become certain that \(B_j\) happened (so \(p'(B_j)=1\)), then \[ p'(A)=p(A\mid B_j). \]
Minimal-change interpretation. Jeffrey updating is the KL projection (the “I-projection”): among all distributions \(q\) satisfying \(q(B_i)=p'(B_i)\), it chooses the \(q\) minimizing \(D_{\mathrm{KL}}(q\|p)\). Intuitively, we enforce the new marginals with the smallest departure from the prior.

The key assumption—sometimes called probability kinematics—is the rigidity condition \(p'(A\mid B_i)=p(A\mid B_i)\). If the evidence would also change how we reason within a cell, Jeffrey updating is the wrong choice.

7 Contrast with hierarchical Bayes

This genuinely differs from the noisy-observation route.

Hierarchical Bayes (mechanism-based). We write down how observations are generated, \(p(y\mid x)\), combine that with \(p(x\mid \theta)\), and update. The posterior typically does change “inside” the partition cells because the likelihood interacts with \(\theta\) and \(x\).
Jeffrey (constraint-based). We do not model the data-generating mechanism. We directly impose new marginals \(p'(B_i)\) and otherwise stay as close as possible to the prior, which forces the within-cell conditionals to remain as they were.

They coincide only in special cases where the mechanism-driven update happens to leave the relevant conditionals \(p(A\mid B_i)\) intact while producing exactly the same new marginal weights.

8 A toy example: a noisy coin flip

Imagine a coin that is either fair (\(\theta=0.5\)) or heavily biased toward heads (\(\theta=0.9\)). Prior: \[ p(\theta=0.5)=p(\theta=0.9)=0.5. \]

Let \(x\in\{\mathrm{H},\mathrm{T}\}\) be the true flip. Suppose we have a sensor reading \(y\) that reports the flip correctly with probability \(0.8\) and incorrectly with probability \(0.2\).

8.1 Route 1: Hierarchical Bayes with a noise model

Model: - \(p(x=\mathrm{H}\mid \theta)=\theta\) (Bernoulli\((\theta)\)) - \(p(y=x\mid x)=0.8\), \(p(y\neq x\mid x)=0.2\)

If the sensor reports \(y=\mathrm{H}\), then \[ p(y=\mathrm{H}\mid \theta)=0.8\,p(x=\mathrm{H}\mid \theta)+0.2\,p(x=\mathrm{T}\mid \theta) =0.8\theta+0.2(1-\theta). \] So: - \(\theta=0.5:\; p(y=\mathrm{H}\mid \theta)=0.5\) - \(\theta=0.9:\; p(y=\mathrm{H}\mid \theta)=0.74\)

Posterior weights \(\propto\), prior \(\times\), likelihood: - \(\theta=0.5:\; 0.5\times 0.5=0.25\) - \(\theta=0.9:\; 0.5\times 0.74=0.37\)

Normalize: \[ p(\theta=0.5\mid y=\mathrm{H})\approx 0.403,\qquad p(\theta=0.9\mid y=\mathrm{H})\approx 0.597. \]

8.2 Route 2: Jeffrey conditioning

Now suppose we don’t trust or specify a sensor model. All we’re willing to assert is a revised marginal distribution over the partition \(\{x=\mathrm{H},x=\mathrm{T}\}\): \[ p'(x=\mathrm{H})=0.8,\qquad p'(x=\mathrm{T})=0.2. \] NB: This is a constraint on our beliefs about \(x\), not a statement that “the sensor is 80% accurate”. Accuracy is about \(p(y\mid x)\); \(p'(x=\mathrm{H})\) is about our posterior marginal of \(x\). Conflating these two is a category error.

Jeffrey’s rule updates \(\theta\) by mixing the old posteriors given each cell: \[ p'(\theta)=\sum_{x\in\{\mathrm{H},\mathrm{T}\}} p(\theta\mid x)\,p'(x). \]

Compute \(p(\theta\mid x)\propto p(\theta)p(x\mid \theta)\):

If \(x=\mathrm{H}\), the weights \((0.5\times 0.5,\; 0.5\times 0.9)=(0.25,0.45)\) normalize to \((0.357,0.643)\).
If \(x=\mathrm{T}\), the weights \((0.5\times 0.5,\; 0.5\times 0.1)=(0.25,0.05)\) normalize to \((0.833,0.167)\).

Now mix with \((0.8,0.2)\): \[ p'(\theta=0.5)=0.8(0.357)+0.2(0.833)\approx 0.452,\\ p'(\theta=0.9)=0.8(0.643)+0.2(0.167)\approx 0.548. \]

8.3 Comparing the two

Hierarchical Bayes (with an explicit sensor model) gave about \((0.403,\,0.597)\) for \((\theta=0.5,\theta=0.9)\).
Jeffrey (imposing only new marginals on \(x\)) gave about \((0.452,\,0.548)\).

Both updates move towards the biased coin, but Jeffrey is less decisive because it doesn’t “reach inside” the partition cells: it only adjusts the mixture weights \(p'(x=\mathrm{H}),p'(x=\mathrm{T})\) and keeps \(p(\theta\mid x)\) fixed.

That’s the trade:

If we’re willing to model the mechanism generating our evidence, we can extract sharper information.
If all we trust is a set of revised marginal constraints, Jeffrey conditioning is the coherent minimal-change update.

9 Disintegration

Chang and Pollard (1997), Kallenberg (2002).

10 BLUE in Gaussian conditioning

E.g. Wilson et al. (2021):

Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space and denote by \((\boldsymbol{a}, \boldsymbol{b})\) a pair of square integrable, centred random variables on \(\mathbb{R}^{n_{a}} \times \mathbb{R}^{n_{b}}\). The conditional expectation is the unique random variable that minimises the optimization problem \[ \mathbb{E}(\boldsymbol{a} \mid \boldsymbol{b})=\underset{\hat{\boldsymbol{a}}=f(\boldsymbol{b})}{\arg \min } \mathbb{E}(\hat{\boldsymbol{a}}-\boldsymbol{a})^{2} \] In words then, \(\mathbb{E}(\boldsymbol{a} \mid \boldsymbol{b})\) is the measurable function of \(\boldsymbol{b}\) that best predicts \(\boldsymbol{a}\) in the sense of minimizing the mean square error \((6)\).

Uncorrelated, jointly Gaussian random variables are independent. Consequently, when \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are jointly Gaussian, the optimal predictor \(\mathbb{E}(\boldsymbol{a} \mid \boldsymbol{b})\) manifests as the best unbiased linear estimator \(\hat{\boldsymbol{a}}=\mathbf{S} \boldsymbol{b}\) of \(\boldsymbol{a}\)

11 Incoming

H.H. Rugh’s answer is nice.

12 References

Alquier, and Gerber. 2024. “Universal Robust Regression via Maximum Mean Discrepancy.” Biometrika.

Chang, and Pollard. 1997. “Conditioning as Disintegration.” Statistica Neerlandica.

Cherief-Abdellatif, and Alquier. 2020. “MMD-Bayes: Robust Bayesian Estimation via Maximum Mean Discrepancy.” In Proceedings of The 2nd Symposium on Advances in Approximate Bayesian Inference.

Cox. 1946. “Probability, Frequency, and Reasonable Expectation.” American Journal of Physics.

Cuzzolin. 2021. “A Geometric Approach to Conditioning Belief Functions.”

Kallenberg. 2002. Foundations of Modern Probability. Probability and Its Applications.

LeBlanc. 1989. “The Autonomy of Probability Theory (Notes on Kolmogorov, Rényi, and Popper).” The British Journal for the Philosophy of Science.

Liao, and Wu. 2015. “Reverse Arithmetic-Harmonic Mean and Mixed Mean Operator Inequalities.” Journal of Inequalities and Applications.

Matthies, Zander, Rosić, et al. 2016. “Parameter Estimation via Conditional Expectation: A Bayesian Inversion.” Advanced Modeling and Simulation in Engineering Sciences.

Mečíř. 2020. “Foundations for Conditional Probability.”

Mond, and Pec̆arić. 1996. “A Mixed Arithmetic-Mean-Harmonic-Mean Matrix Inequality.” Linear Algebra and Its Applications, Linear Algebra and Statistics: In Celebration of C. R. Rao’s 75th Birthday (September 10, 1995),.

Schervish. 2012. Theory of Statistics. Springer Series in Statistics.

Sharma. 2008. “Some More Inequalities for Arithmetic Mean, Harmonic Mean and Variance.” Journal of Mathematical Inequalities.

Smets. 2013. “Jeffrey’s Rule of Conditioning Generalized to Belief Functions.”

Taraldsen. 2019. “Conditional Probability in Rényi Spaces.”

Van Horn. 2003. “Constructing a Logic of Plausible Inference: A Guide to Cox’s Theorem.” International Journal of Approximate Reasoning.

Weisberg. 2015. “Updating, Undermining, and Independence.” British Journal for the Philosophy of Science.

Wilson, Borovitskiy, Terenin, et al. 2021. “Pathwise Conditioning of Gaussian Processes.” Journal of Machine Learning Research.

Wroński. 2016. “Belief Update Methods and Rules—Some Comparisons.” Ergo, an Open Access Journal of Philosophy.