Bayes’ rule tells us how to condition data upon observations. So far, so uncontroversial. However, there is a lot more to understand about what this means, how we can generalize and relax this idea.

I don’t have time to summarize everything there, but here are some morsels I should be aware about.

Figure 1

1 Todo

Things I would like to re-derive for my own entertainment:

Conditioning in the sense of measure-theoretic probability. Kolmogorov formulation. Conditioning as Radon-Nikodym derivative. Clunkiness of definition due to niceties of Lebesgue integration.

2 Conditional algebra

TBC

3 Nonparametric

Conditioning in full measure-theoretic glory for Bayesian nonparametrics. E.g. conditioning of Gaussian Processes is fun.

4 Generalized conditioning 1: Jeffreys

When I think about belief updating, the classic Bayesian story goes like this: I have a parameter \(\theta\), I see some noisy data \(y\), and O update via Bayes’ rule.

In hierarchical form we often write

\[ p(\theta \mid y) \propto p(\theta)\int p(y \mid x)\,p(x \mid \theta)\,dx, \]

where \(x\) is some latent “true” signal and \(y\) is its noisy measurement. This is the bread-and-butter of Bayesian inference: specify a generative model for the noise, marginalize it out, and we get our posterior.

So far, so familiar.

Sometimes, though, I don’t want to assume such a mechanism. Suppose all I know is something like: “I’m now 80% confident the test is positive, 20% that it’s negative.”

Notice the difference: I’m not asserting a likelihood of observations given a latent truth. I’m just asserting a new distribution over a partition \(\{B_1, B_2\}\) = {positive, negative}.

This is where Jeffrey’s conditioning (or Jeffrey’s rule) comes in. Jeffrey’s idea was simple: if we revise the probabilities of a partition \(\{B_i\}\) to new values \(p'(B_i)\), then we should update any event \(A\) by

\[ p'(A) = \sum_i p(A \mid B_i)\, p'(B_i). \]

Key assumption: the conditional beliefs inside each cell of the partition, \(p(A\mid B_i)\), stay exactly as they were before. That’s sometimes called probability kinematics for some reason.

So Jeffrey updating is like saying: “shift the mixture weights on the partition, but keep the conditional structure inside each partition block intact.”

5 Contrast with hierarchical Bayes

That’s quite different from the hierarchical noisy-observation approach:

  • In the hierarchical Bayes route, we specify a likelihood \(p(y\mid x)\), combine it with our prior, and update. This often changes the conditional structure \(p(A\mid B_i)\), because the likelihood reaches inside the partition and reshuffles things.

  • In the Jeffrey route, we don’t model the noise mechanism at all. Instead, we impose the new marginals \(p'(B_i)\) directly, and adjust the prior by the smallest possible change (in fact, it’s the KL-projection that enforces those marginals).

So: hierarchical Bayes is mechanism-based, Jeffrey is constraint-based.

There are rare cases where the two coincide: namely when the likelihood you chose implies exactly the same new marginals and, crucially, leaves the conditionals \(p(A\mid B_i)\) intact.

In general, the two updates diverge. If I feed “80% positive” evidence into Jeffrey, I’ll get a linear mixture of prior conditionals. If I encode the same statement as a noisy likelihood and marginalize, I’ll usually get a posterior that’s nonlinear and different.

6 A toy example: the noisy coin flip

Let’s make it concrete.

Imagine I have a coin that might be fair (\(\theta=0.5\)) or biased to heads (\(\theta=0.9\)). I put equal prior weight on the two hypotheses: \(p(\theta=0.5)=p(\theta=0.9)=0.5\).

Now suppose I “see” a coin flip, but only through a noisy channel. Maybe a dodgy sensor reports the outcome, and it gets it right 80% of the time and wrong 20% of the time.

6.1 Route 1: Hierarchical Bayes with noise model

Let \(x\) be the true flip, \(y\) the sensor reading. The generative model is:

  • \(p(x \mid \theta)\) is Bernoulli(\(\theta\)).
  • \(p(y \mid x)\) is \(0.8\) if \(y=x\), \(0.2\) otherwise.

Suppose the sensor reports \(y=\text{heads}\).

Then the likelihood for \(\theta\) is:

\[ p(y=\text{heads} \mid \theta) = 0.8 \cdot p(x=\text{heads}\mid \theta) + 0.2 \cdot p(x=\text{tails}\mid \theta). \]

So for \(\theta=0.5\), that’s \(0.8(0.5) + 0.2(0.5) = 0.5\). For \(\theta=0.9\), that’s \(0.8(0.9) + 0.2(0.1) = 0.74\).

Bayes’ rule gives posterior weights proportional to prior \(\times\) likelihood:

  • \(\theta=0.5\): \(0.5 \times 0.5 = 0.25\)
  • \(\theta=0.9\): \(0.5 \times 0.74 = 0.37\)

Normalize: posterior is about \((0.40, 0.60)\) in favor of the biased coin.

6.2 Route 2: Jeffrey conditioning

Now let’s suppose I don’t model the sensor. Instead I just say: “Based on what I’ve seen, I’m now 80% confident the coin flip was heads, 20% tails.”

That is: I treat \(\{x=\text{heads}, x=\text{tails}\}\) as my partition, and update to \(p'(x=\text{heads})=0.8, \; p'(x=\text{tails})=0.2\).

Jeffrey’s rule says for each \(\theta\):

\[ p'(\theta) = \sum_{x \in \{\text{H},\text{T}\}} p(\theta \mid x)\, p'(x). \]

But \(p(\theta \mid x) \propto p(\theta) p(x \mid \theta)\). Do the maths:

  • If \(x=\text{heads}\), posterior weights are proportional to \((0.5 \times 0.5, \; 0.5 \times 0.9) = (0.25, 0.45)\), which normalize to \((0.36, 0.64)\).
  • If \(x=\text{tails}\), weights are proportional to \((0.5 \times 0.5, \; 0.5 \times 0.1) = (0.25, 0.05)\), which normalize to \((0.83, 0.17)\).

Now average them with the Jeffrey weights \((0.8, 0.2)\):

\[ p'(\theta=0.5) = 0.8(0.36) + 0.2(0.83) \approx 0.47, \\ p'(\theta=0.9) = 0.8(0.64) + 0.2(0.17) \approx 0.53. \]

So the Jeffrey posterior is about \((0.47, 0.53)\) — much closer to the prior balance, less decisive than the hierarchical Bayes update.

6.3 Comparing the two

  • Hierarchical Bayes gave \((0.40, 0.60)\) for (fair, biased).
  • Jeffrey gave \((0.47, 0.53)\).

They both shift toward the biased coin, but Jeffrey’s update is weaker. Why? Because Jeffrey respects the rigidity assumption: \(p(A\mid B)\) is frozen, and only the mixture weights shift. The likelihood-driven hierarchical update, by contrast, reaches inside and reshuffles the conditional structure, which in this case pulls harder toward \(\theta=0.9\).

This tiny coin example highlights the difference:

  • If you trust a noise model, you get sharper inferences.
  • If you only trust revised marginals, you get Jeffrey’s softer, minimal-change update.

And that’s the essence of the distinction.

7 Disintegration

Chang and Pollard (1997), Kallenberg (2002).

8 BLUE in Gaussian conditioning

e.g. Wilson et al. (2021):

Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space and denote by \((\boldsymbol{a}, \boldsymbol{b})\) a pair of square integrable, centred random variables on \(\mathbb{R}^{n_{a}} \times \mathbb{R}^{n_{b}}\). The conditional expectation is the unique random variable that minimises the optimization problem \[ \mathbb{E}(\boldsymbol{a} \mid \boldsymbol{b})=\underset{\hat{\boldsymbol{a}}=f(\boldsymbol{b})}{\arg \min } \mathbb{E}(\hat{\boldsymbol{a}}-\boldsymbol{a})^{2} \] In words then, \(\mathbb{E}(\boldsymbol{a} \mid \boldsymbol{b})\) is the measurable function of \(\boldsymbol{b}\) that best predicts \(\boldsymbol{a}\) in the sense of minimizing the mean square error \((6)\).

Uncorrelated, jointly Gaussian random variables are independent. Consequently, when \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are jointly Gaussian, the optimal predictor \(\mathbb{E}(\boldsymbol{a} \mid \boldsymbol{b})\) manifests as the best unbiased linear estimator \(\hat{\boldsymbol{a}}=\mathbf{S} \boldsymbol{b}\) of \(\boldsymbol{a}\)

9 Incoming

H.H. Rugh’s answer is nice.

10 References

Alquier, and Gerber. 2024. Universal Robust Regression via Maximum Mean Discrepancy.” Biometrika.
Chang, and Pollard. 1997. Conditioning as Disintegration.” Statistica Neerlandica.
Cherief-Abdellatif, and Alquier. 2020. MMD-Bayes: Robust Bayesian Estimation via Maximum Mean Discrepancy.” In Proceedings of The 2nd Symposium on Advances in Approximate Bayesian Inference.
Cuzzolin. 2021. A Geometric Approach to Conditioning Belief Functions.”
Kallenberg. 2002. Foundations of Modern Probability. Probability and Its Applications.
Liao, and Wu. 2015. Reverse Arithmetic-Harmonic Mean and Mixed Mean Operator Inequalities.” Journal of Inequalities and Applications.
Matthies, Zander, Rosić, et al. 2016. Parameter Estimation via Conditional Expectation: A Bayesian Inversion.” Advanced Modeling and Simulation in Engineering Sciences.
Mond, and Pec̆arić. 1996. A Mixed Arithmetic-Mean-Harmonic-Mean Matrix Inequality.” Linear Algebra and Its Applications, Linear Algebra and Statistics: In Celebration of C. R. Rao’s 75th Birthday (September 10, 1995),.
Schervish. 2012. Theory of Statistics. Springer Series in Statistics.
Sharma. 2008. Some More Inequalities for Arithmetic Mean, Harmonic Mean and Variance.” Journal of Mathematical Inequalities.
Smets. 2013. Jeffrey’s Rule of Conditioning Generalized to Belief Functions.”
Weisberg. 2015. Updating, Undermining, and Independence.” British Journal for the Philosophy of Science.
Wilson, Borovitskiy, Terenin, et al. 2021. Pathwise Conditioning of Gaussian Processes.” Journal of Machine Learning Research.
Wroński. 2016. Belief Update Methods and Rules—Some Comparisons.” Ergo, an Open Access Journal of Philosophy.