Conditioning
Conditional expectation and probability
2020-02-03 — 2025-08-31
Suspiciously similar content
Bayes’ rule tells us how to condition data upon observations. So far, so uncontroversial. However, there is a lot more to understand about what this means, how we can generalize and relax this idea.
I don’t have time to summarize everything there, but here are some morsels I should be aware about.
1 Todo
Things I would like to re-derive for my own entertainment:
Conditioning in the sense of measure-theoretic probability. Kolmogorov formulation. Conditioning as Radon-Nikodym derivative. Clunkiness of definition due to niceties of Lebesgue integration.
2 Conditional algebra
TBC
3 Nonparametric
Conditioning in full measure-theoretic glory for Bayesian nonparametrics. E.g. conditioning of Gaussian Processes is fun.
4 Generalized conditioning 1: Jeffreys
When I think about belief updating, the classic Bayesian story goes like this: I have a parameter \(\theta\), I see some noisy data \(y\), and O update via Bayes’ rule.
In hierarchical form we often write
\[ p(\theta \mid y) \propto p(\theta)\int p(y \mid x)\,p(x \mid \theta)\,dx, \]
where \(x\) is some latent “true” signal and \(y\) is its noisy measurement. This is the bread-and-butter of Bayesian inference: specify a generative model for the noise, marginalize it out, and we get our posterior.
So far, so familiar.
Sometimes, though, I don’t want to assume such a mechanism. Suppose all I know is something like: “I’m now 80% confident the test is positive, 20% that it’s negative.”
Notice the difference: I’m not asserting a likelihood of observations given a latent truth. I’m just asserting a new distribution over a partition \(\{B_1, B_2\}\) = {positive, negative}.
This is where Jeffrey’s conditioning (or Jeffrey’s rule) comes in. Jeffrey’s idea was simple: if we revise the probabilities of a partition \(\{B_i\}\) to new values \(p'(B_i)\), then we should update any event \(A\) by
\[ p'(A) = \sum_i p(A \mid B_i)\, p'(B_i). \]
Key assumption: the conditional beliefs inside each cell of the partition, \(p(A\mid B_i)\), stay exactly as they were before. That’s sometimes called probability kinematics for some reason.
So Jeffrey updating is like saying: “shift the mixture weights on the partition, but keep the conditional structure inside each partition block intact.”
5 Contrast with hierarchical Bayes
That’s quite different from the hierarchical noisy-observation approach:
In the hierarchical Bayes route, we specify a likelihood \(p(y\mid x)\), combine it with our prior, and update. This often changes the conditional structure \(p(A\mid B_i)\), because the likelihood reaches inside the partition and reshuffles things.
In the Jeffrey route, we don’t model the noise mechanism at all. Instead, we impose the new marginals \(p'(B_i)\) directly, and adjust the prior by the smallest possible change (in fact, it’s the KL-projection that enforces those marginals).
So: hierarchical Bayes is mechanism-based, Jeffrey is constraint-based.
There are rare cases where the two coincide: namely when the likelihood you chose implies exactly the same new marginals and, crucially, leaves the conditionals \(p(A\mid B_i)\) intact.
In general, the two updates diverge. If I feed “80% positive” evidence into Jeffrey, I’ll get a linear mixture of prior conditionals. If I encode the same statement as a noisy likelihood and marginalize, I’ll usually get a posterior that’s nonlinear and different.
6 A toy example: the noisy coin flip
Let’s make it concrete.
Imagine I have a coin that might be fair (\(\theta=0.5\)) or biased to heads (\(\theta=0.9\)). I put equal prior weight on the two hypotheses: \(p(\theta=0.5)=p(\theta=0.9)=0.5\).
Now suppose I “see” a coin flip, but only through a noisy channel. Maybe a dodgy sensor reports the outcome, and it gets it right 80% of the time and wrong 20% of the time.
6.1 Route 1: Hierarchical Bayes with noise model
Let \(x\) be the true flip, \(y\) the sensor reading. The generative model is:
- \(p(x \mid \theta)\) is Bernoulli(\(\theta\)).
- \(p(y \mid x)\) is \(0.8\) if \(y=x\), \(0.2\) otherwise.
Suppose the sensor reports \(y=\text{heads}\).
Then the likelihood for \(\theta\) is:
\[ p(y=\text{heads} \mid \theta) = 0.8 \cdot p(x=\text{heads}\mid \theta) + 0.2 \cdot p(x=\text{tails}\mid \theta). \]
So for \(\theta=0.5\), that’s \(0.8(0.5) + 0.2(0.5) = 0.5\). For \(\theta=0.9\), that’s \(0.8(0.9) + 0.2(0.1) = 0.74\).
Bayes’ rule gives posterior weights proportional to prior \(\times\) likelihood:
- \(\theta=0.5\): \(0.5 \times 0.5 = 0.25\)
- \(\theta=0.9\): \(0.5 \times 0.74 = 0.37\)
Normalize: posterior is about \((0.40, 0.60)\) in favor of the biased coin.
6.2 Route 2: Jeffrey conditioning
Now let’s suppose I don’t model the sensor. Instead I just say: “Based on what I’ve seen, I’m now 80% confident the coin flip was heads, 20% tails.”
That is: I treat \(\{x=\text{heads}, x=\text{tails}\}\) as my partition, and update to \(p'(x=\text{heads})=0.8, \; p'(x=\text{tails})=0.2\).
Jeffrey’s rule says for each \(\theta\):
\[ p'(\theta) = \sum_{x \in \{\text{H},\text{T}\}} p(\theta \mid x)\, p'(x). \]
But \(p(\theta \mid x) \propto p(\theta) p(x \mid \theta)\). Do the maths:
- If \(x=\text{heads}\), posterior weights are proportional to \((0.5 \times 0.5, \; 0.5 \times 0.9) = (0.25, 0.45)\), which normalize to \((0.36, 0.64)\).
- If \(x=\text{tails}\), weights are proportional to \((0.5 \times 0.5, \; 0.5 \times 0.1) = (0.25, 0.05)\), which normalize to \((0.83, 0.17)\).
Now average them with the Jeffrey weights \((0.8, 0.2)\):
\[ p'(\theta=0.5) = 0.8(0.36) + 0.2(0.83) \approx 0.47, \\ p'(\theta=0.9) = 0.8(0.64) + 0.2(0.17) \approx 0.53. \]
So the Jeffrey posterior is about \((0.47, 0.53)\) — much closer to the prior balance, less decisive than the hierarchical Bayes update.
6.3 Comparing the two
- Hierarchical Bayes gave \((0.40, 0.60)\) for (fair, biased).
- Jeffrey gave \((0.47, 0.53)\).
They both shift toward the biased coin, but Jeffrey’s update is weaker. Why? Because Jeffrey respects the rigidity assumption: \(p(A\mid B)\) is frozen, and only the mixture weights shift. The likelihood-driven hierarchical update, by contrast, reaches inside and reshuffles the conditional structure, which in this case pulls harder toward \(\theta=0.9\).
This tiny coin example highlights the difference:
- If you trust a noise model, you get sharper inferences.
- If you only trust revised marginals, you get Jeffrey’s softer, minimal-change update.
And that’s the essence of the distinction.
7 Disintegration
8 BLUE in Gaussian conditioning
e.g. Wilson et al. (2021):
Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space and denote by \((\boldsymbol{a}, \boldsymbol{b})\) a pair of square integrable, centred random variables on \(\mathbb{R}^{n_{a}} \times \mathbb{R}^{n_{b}}\). The conditional expectation is the unique random variable that minimises the optimization problem \[ \mathbb{E}(\boldsymbol{a} \mid \boldsymbol{b})=\underset{\hat{\boldsymbol{a}}=f(\boldsymbol{b})}{\arg \min } \mathbb{E}(\hat{\boldsymbol{a}}-\boldsymbol{a})^{2} \] In words then, \(\mathbb{E}(\boldsymbol{a} \mid \boldsymbol{b})\) is the measurable function of \(\boldsymbol{b}\) that best predicts \(\boldsymbol{a}\) in the sense of minimizing the mean square error \((6)\).
Uncorrelated, jointly Gaussian random variables are independent. Consequently, when \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are jointly Gaussian, the optimal predictor \(\mathbb{E}(\boldsymbol{a} \mid \boldsymbol{b})\) manifests as the best unbiased linear estimator \(\hat{\boldsymbol{a}}=\mathbf{S} \boldsymbol{b}\) of \(\boldsymbol{a}\)
9 Incoming
H.H. Rugh’s answer is nice.