Markov decision problems

2014-11-27 — 2026-05-15

Wherein a Beer Warehouse Inventory Problem Is Traced From a Simple MDP to a Multi-Agent Decentralised POMDP, With Belief States and Bellman Recursion Introduced Along the Way

bandit problems
control
dynamical systems
linear algebra
optimization
probability
signal processing
statistics
stochastic processes
stringology
time series
Figure 1

Partially Observable Markov Decision Processes (POMDPs) are a meeting-point for many different methods and frameworks for sequential decision-making under uncertainty. They also give us a nice clean formalism for acting in the world, about which it is easy to build intuition, before leaping into the extensions we need in practice (approximations, non-stationary versions, multi-agent versions, online-learning versions, etc).

See also optimal control, reinforcement learning

1 The fully observable case — MDP

A Markov decision process (MDP) is a discrete-time stochastic control problem. We define one by specifying a tuple \((\mathcal{S}, \mathcal{A}, T, R, \gamma)\):

\(\mathcal{S}\)
State space.
\(\mathcal{A}\)
Action space.
\(T(s’ \mid s, a) = P(\mathsf{s}_{t+1} = s’ \mid \mathsf{s}_t = s, \mathsf{a}_t = a)\)
Transition dynamics — (typically, the ‘physics’).
\(R(s, a)\)
Reward, or running cost. A function of the current state and action.
\(\gamma \in [0, 1)\)
Discount factor.

Each time step: the agent sees state \(s_t\), picks action \(\mathsf{a}_t \sim \pi(\cdot \mid s_t)\), pays or earns \(R(s_t, a_t)\), and the world advances to \(\mathsf{s}_{t+1} \sim T(\cdot \mid s_t, a_t)\).

1.1 Running example: beer warehouse

I repeatedly refer to a concrete inventory problem throughout. This is a self-contained inventory problem that we work towards Sterman’s Beer Game (1989) over the course of the chapter — the full multi-player, partially-observable supply chain lands in Section 2. We bring in the complications of the real game only as we need them.

A bottle shop manages stock of a single product: crates of beer. Each week, the state \(s_t \in \{0, 1, \ldots, 20\}\) is crates on hand. The action \(a_t \in \{0, 1, \ldots, 10\}\) is how many crates to order from the distributor, delivered instantly.1 Customer demand is i.i.d. across weeks. We use sans-serif for random variables and italic for their realizations, so \(\mathsf{d}_t \sim \operatorname{Poisson}(4)\) is the random demand at week \(t\) and \(d_t \sim \mathsf{d}_t\) is a specific drawn value. On average about 4 crates per week, but with a fat right tail: a run on beer is always possible.

Transition dynamics:

\[ \mathsf{s}_{t+1} = \operatorname{clamp}_{0}^{20}(s_t + a_t - \mathsf{d}_t). \tag{1}\]

Clamping at zero means excess demand is lost (no backorders); clamping at 20 means the warehouse is full. This defines \(T(s’ \mid s, a)\) implicitly through the demand distribution.

Now, the business side. Each crate sells for $15 at retail. Each crate costs $10 to order from the distributor. Holding a crate in the warehouse for a week costs $2 (refrigeration, opportunity cost of capital, floor space). Stocking out costs $10 per crate of unmet demand in reputational damage — telling a customer “no beer” hurts more than just this week’s lost sale.2

The per-step reward is then:

\[ R(s_t, a_t) = \underbrace{15 \cdot \mathbb{E}[\min(\mathsf{d}_t,\, s_t + a_t)]}_{\text{revenue}} - \underbrace{10\, a_t}_{\text{ordering}} - \underbrace{2 (s_t + a_t)}_{\text{holding}} - \underbrace{10 \cdot \mathbb{E}[\max(\mathsf{d}_t - s_t - a_t,\; 0)]}_{\text{reputational}}. \tag{2}\]

\(R\) can be positive (made a profit this week) or negative (made a loss). Bigger profits are presumably better.

Suppose we have \(s_t = 3\) crates on hand and order \(a_t = 5\).

  • Ordering costs \(10 \times 5 = \$50\).
  • We hold \(3 + 5 = 8\) crates, costing \(2 \times 8 = \$16\).
  • With 8 crates available and \(\mathsf{d}_t \sim \operatorname{Poisson}(4)\), the probability of a stockout \(\mathsf{d}_t > 8\) is about 0.02 — small but not zero — so expected unmet demand is \(\mathbb{E}[\max(\mathsf{d}_t - 8, 0)] \approx 0.024\) crates, costing \(10 \times 0.024 \approx \$0.24\) in reputational damage.
  • Expected sales are \(\mathbb{E}[\min(\mathsf{d}_t, 8)] = \mathbb{E}[\mathsf{d}_t] - \mathbb{E}[\max(\mathsf{d}_t - 8, 0)] \approx 3.98\) crates, generating expected revenue \(15 \times 3.98 \approx \$59.64\).
  • Total: \(R(3, 5) \approx 59.64 - 50 - 16 - 0.24 \approx -\$6.60\).

1.2 The objective and discounting

We want to maximise total discounted reward (the return):

\[ G = \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t). \tag{3}\]

Notably, this is an additive objective — per-step rewards just accumulate, which is the natural accounting for a warehouse that pays its bills every week. The expectation \(\mathbb{E}[G]\) is also a linear functional of the joint distribution of \((R_0, R_1, \ldots)\), which is what makes Bellman recursion go through: linearity of expectation lets us swap \(\mathbb{E}[\cdot]\) and \(\sum_t\), and the per-step decomposition follows. Other objectives we might care about — variance of \(G\) (quadratic in the distribution), CVaR (depends on the lower tail), worst-case (an infimum) — are nonlinear functionals of the return distribution, so the swap fails and Bellman recursion does not apply directly. These are sometimes lumped together as risk-sensitive objectives, but the technical obstacle is non-linearity, not risk per se.

We set \(\gamma = 0.95\), which means a dollar of reward next week is equivalent to 95 cents this week. The discount factor makes the infinite sum converge, and gives us a preference for collecting rewards sooner (and deferring costs). For the warehouse it could reflect the time value of money, or just be a mathematical convenience.

1.3 The arrow of time

Notice the \(\mathbb{E}\) in Equation 2: the revenue and stockout terms both depend on the random \(\mathsf{d}_t\), which has not yet been drawn at decision time. So \(R(3, 5) \approx -\$6.60\) is the expected profit for the coming week, not what we will actually realise when that week arrives. The agent has to pick \(a_t\) this week — before seeing \(\mathsf{d}_t\) — so the objective must be a function of what is known then, \((s_t, a_t)\), and randomness over the demand draw gets marginalised into the expectation. Some weeks we make a profit, some weeks we stock out and lose more, etc.; \(R(s_t, a_t)\) is the mean of those outcomes.

Do we ever get rewards based on what actually happened, rather than what we expect to happen? We could instead write the reward as a function of the realised state of the world, \(r(s_t, a_t, d_t)\) now, and average over \(\mathsf{d}_t\) later (inside the Bellman recursion) instead of now. That the two formulations give the same return in expectation is not obvious, at least not to me.

But they are equivalent, at least in expectation. To see this we proceed in two stages: first per-step (a single marginalisation), then telescoped over time.

We define realised reward \(r(s, a, s’)\) that depends on the transition outcome. We can always marginalise to recover \(R(s, a)\):

\[ \begin{aligned} R(s, a) &= \mathbb{E}_{\mathsf{s}' \sim T(\cdot \mid s, a)}\left[ r(s, a, \mathsf{s}')\right]\\ &= \sum_{s'} T(s' \mid s, a)\, r(s, a, s'), \end{aligned} \tag{4}\]

absorbing the “what actually happened?” question into the expectation up front, so the per-step reward \(R(s, a)\) depends only on what is known at decision time. In the warehouse, we could define \(r(s, a, s’)\) to penalise the realized stockout for a specific demand draw rather than its expectation; marginalising over demand recovers Equation 2.

This is a per-step equality of conditional expectations: \(R(s, a) = \mathbb{E}[r(s, a, \mathsf{s}’) \mid s, a]\), by construction of \(R\) from \(r\). Now, when we sum over time, do the expectations of the two returns match?

\[ G' = \sum_t \gamma^t r(s_t, a_t, s_{t+1}) \quad \text{vs.} \quad G = \sum_t \gamma^t R(s_t, a_t). \tag{5}\]

Yes. The marginalisation definition of \(R\) at Equation 4 gives, at every step,

\[ \mathbb{E}[r(s_t, a_t, \mathsf{s}_{t+1}) \mid s_t, a_t] = R(s_t, a_t), \]

where the (inner) expectation is over the random next state \(\mathsf{s}_{t+1}\) given \(s_t, a_t\). Linearity of expectation and the tower property then give

\[ \begin{aligned} \mathbb{E}[G'] &= \sum_t \gamma^t\, \mathbb{E}[r(s_t, a_t, \mathsf{s}_{t+1})] & \text{ by linearity of expectation} \\ &= \sum_t \gamma^t\, \mathbb{E}\!\left[\mathbb{E}[r(s_t, a_t, \mathsf{s}_{t+1}) \mid s_t, a_t]\right] & \text{ by the tower property} \\ &= \sum_t \gamma^t\, \mathbb{E}[R(s_t, a_t)] = \mathbb{E}[G] & \text{ by the definition of $R$}. \end{aligned} \tag{6}\]

Both bookkeeping conventions yield the same expected return, and any agent maximising \(\mathbb{E}[\cdot]\) picks the same actions.

Note: \(G’\) and \(G\) are not the same random variable — \(G’\) has higher variance since within-step randomness has not been averaged out — so we are asking only about their expectations.

The moral ground for reward-is-expectation is a measurability argument: we want the agent’s reward at time \(t\) to be a function of what it knows at that time \(t\). Past randomness is already absorbed into \(s_t\); future randomness has not happened; between-step randomness — the demand draw, in our case — is the bit we marginalise into \(R\) to get something measurable at decision time. The realised-reward formulation \(r(s_t, a_t, \mathsf{s}_{t+1})\) defers that marginalisation to the Bellman expectation; the \(R(s_t, a_t)\) form does it up-front.

As flagged before the derivation, the equality is between expectations — the random variables \(G’\) and \(G\) themselves are different (different variances, different distributions). For ordinary MDP optimisation we only care about \(\mathbb{E}[G]\), so this is fine. It would matter for objectives that aren’t linear functionals of \(G\) — e.g. the risk-sensitive objectives like variance, CVaR, worst-case — see Section 1.2.

1.4 The Markov property

We have mostly suppressed the complexities of measurability/filtrations etc. here, but let us take a moment to make this covert aspect explicit. What independencies do we need to make this go? The Markov property of the transition dynamics is the main one.

\[ P(\mathsf{s}_{t+1} \mid \mathsf{s}_t, \mathsf{a}_t, \mathsf{s}_{t-1}, \mathsf{a}_{t-1}, \ldots) = P(\mathsf{s}_{t+1} \mid \mathsf{s}_t, \mathsf{a}_t) = T(\mathsf{s}_{t+1} \mid \mathsf{s}_t, \mathsf{a}_t). \tag{7}\]

i.e. The future is conditionally independent of the past given the present state. For the warehouse problem: if I know there are 3 crates on the shelf and I’m ordering 5 more, the history of past orders and past demand tells me nothing additional about next week’s stock level.

For our simple warehouse, the Markov property holds because we set it up that way — demand is i.i.d., delivery is instant, and the transition (Equation 1) depends only on current stock plus current order minus current demand. But now suppose the demand varies seasonally (going up around sporting events or holidays). Suddenly we need to know which week of the year it is, and that is not captured by \(\mathcal{S}\). Suddenly the same stock level of 3 crates has different implications depending on whether it’s semester break or not, and the history does contain information (what date is it?) that the state alone doesn’t capture. The Markov property with respect to the stock level alone would in that setting have failed because our state is too small. The “fix” in that case is to expand \(\mathcal{S}\) to include the date, so that the state fully captures the relevant information about the past. Technically what we want is for the state to be a sufficient statistic for the history — it must capture enough of the past that, given the current state, the past adds no further predictive power about the future.

Should we augment the state like this? We return to this in the partially observable case.

Aside on modelling adequacy. Markovity is a structural property of the formal model; it is on us to determine whether the model is a good approximation of reality. A model can be Markov but a poor approximation (we left out features of the world we ‘should’ have included), or a faithful approximation but non-Markov (we included them but the dynamics depend on history we did not encode). Whether \(\mathcal{S}\) adequately captures the world is a separate design judgement, which, in accordance with tradition, we totally ignore for now.

When we break full observability in Section 2, we’ll see this same failure-to-be-Markov formalised: an agent that conditions on the wrong information is implicitly using the wrong state.

1.5 Value functions, and the Bellman recursion

Let’s define more stuff.

A policy \(\pi\) is the agent’s strategy for choosing actions given states. In general it is stochastic: \(\pi(a \mid s)\) is the probability of picking action \(a\) in state \(s\), with \(\pi(\cdot \mid s) \in \Delta(\mathcal{A})\) a distribution over the action space.

To find or improve a policy, it helps to define some concepts that naturally quantify “how good is this state?” and “how good is this action?”. The two standard measures are the value function \(V^\pi(s)\) (“how good is state \(s\) under \(\pi\)?”) and the action-value function \(Q^\pi(s, a)\) (“how good is taking action \(a\) at state \(s\) and then following \(\pi\)?”). A Bellman equation is ‘recursive’ in the sense that the unknown function appears on both sides; for finite MDPs that turns into a linear system in \(|\mathcal{S}|\) unknowns, which we solve to evaluate policies in Section 1.6 and, in a maximised form, to characterise the optimal policy directly in Section 1.7.

We notate deterministic policies with a hat: \(\hat{\pi}\) is a policy whose distribution at every state concentrates on a single action, and we write \(\hat{\pi}(s) \in \mathcal{A}\) for that action (equivalently, \(\hat{\pi}(\cdot \mid s) = \delta_{\hat{\pi}(s)}\)). We index the iterates of policy iteration as \(\hat{\pi}_0, \hat{\pi}_1, \ldots\), with the optimum (if it exists) written \(\hat{\pi}^*\). Both forms appear: stochastic \(\pi(a \mid s)\) inside the general Bellman expectations of this section, and deterministic \(\hat{\pi}\) from Section 1.6 onwards — because greedy improvement of a stochastic policy produces a deterministic policy, and the optimum is deterministic for finite-state-action MDPs.

Ok, now given a policy \(\pi\), the value function is the expected return from state \(s\) under that policy:

\[ V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{k=0}^{\infty} \gamma^k R(\mathsf{s}_{t+k}, \mathsf{a}_{t+k}) \;\middle|\; \mathsf{s}_t = s\right]. \tag{8}\]

We can split this into the immediate reward plus the discounted future by peeling off the first term,

\[ V^\pi(s) = \sum_{a} \pi(a \mid s) \left[ R(s, a) + \gamma \sum_{s'} T(s' \mid s, a)\, V^\pi(s') \right]. \tag{9}\]

This is the Bellman expectation equation. The value of being in state \(s\) equals the immediate reward plus \(\gamma\) times the expected value of wherever we land, averaged over our action choice (policy) and the stochastic transition (dynamics).

For the warehouse with \(\gamma = 0.95\), the value of having 3 crates on hand, under some deterministic ordering policy \(\hat{\pi}\), is this week’s expected reward \(R(3, \hat{\pi}(3))\) — about \(-\$7\) if we order 5 — plus \(0.95\) times the expected value of whatever stock level we find ourselves at next week. That second term averages over demand: if \(d_t = 2\) we’ll have 6 crates left, if \(d_t = 7\) we’ll have 1, and the value function \(V^\pi\) tells us how valuable each of those situations is going forward.

\(V^\pi\) tells us how good each state is under policy \(\pi\), but it doesn’t let us compare actions — it has already averaged over whatever \(\pi\) does. To ask “what if I deviated and ordered 5 instead of 4 this week, then went back to following \(\pi\)?” we need the Q function a.k.a. action-value function, that we previewed above

\[ Q^\pi(s, a) = R(s, a) + \gamma \sum_{s'} T(s' \mid s, a)\, V^\pi(s'), \tag{10}\]

so that \[V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a). \]

\(Q^\pi(3, 5)\) is the expected total discounted reward if we have 3 crates, order 5 this week, and follow \(\pi\) from next week onward. \(Q^\pi(3, 4)\) is the same but ordering 4. Comparing them tells us which order quantity is better at this state, under this policy.

1.6 Optimal policies via policy improvement

Start: we suppose we know the tuple \((\mathcal{S}, \mathcal{A}, T, R, \gamma)\) but not the optimal policy. How do we compute it?

One option is brute force. We have the model in closed form, so given any deterministic policy \(\hat{\pi}\) we can compute its value function \(V^{\hat{\pi}}\) exactly by solving the linear system Equation 9 (21 unknowns, 21 equations for the warehouse), and read off \(Q^{\hat{\pi}}\) via Equation 10. So one route to \(\hat{\pi}^*\) is brute force: enumerate every deterministic policy (\(|\mathcal{A}|^{|\mathcal{S}|}\) of them — for the warehouse, \(11^{21} \approx 10^{22}\)), evaluate each, pick the best. Not practical though.

There is a smarter method which solves the Bellman optimality equation (Equation 13, derived in the next section) indirectly. As written, it involves a \(\max\) and so isn’t linear, but it can be reformulated as a linear program (d’Epenoux 1963; Manne 1960) — whose unique optimal solution is \(V^*\), from which \(\hat{\pi}^*\) falls out. That doesn’t feel very pedagogic or useful to me, and I have no intuition for it, so let’s ignore it.

Policy iteration is, I think, the default. It is due to Howard (1960) (though I probably learned it from a sheaf of poorly attributed photocopied notes, and then forgot it until now, re-learning it before your eyes).

This method is kind of intuitive, IMO. In policy iteration we iteratively improve policies starting from some guess. For my ML peeps, let me clarify that the “iteration” here is over policies, not over time nor over any samples from some dataset. There is no learning to do; this is a computational slog. Where the model is unknown, we’d be doing reinforcement learning instead, which has its own bunch of algorithms. Here we know everything about the system; we want to show that it is also possible to compute how to take the best action.

OK, so we proceed by building a sequence of candidate deterministic policies \(\hat{\pi}_0, \hat{\pi}_1, \hat{\pi}_2, \ldots\), each weakly better than the last, by alternating:

  1. Evaluate \(\hat{\pi}_n\). Solve Equation 9 for \(V^{\hat{\pi}_n}(s)\) at every state \(s\), hence \(Q^{\hat{\pi}_n}(s, a)\) at every state-action pair.

  2. Improve to \(\hat{\pi}_{n+1}\). At each state \(s\), replace \(\hat{\pi}_n\)’s action by the one that maximises \(Q^{\hat{\pi}_n}\):

    \[ \hat{\pi}_{n+1}(s) = \arg\max_a Q^{\hat{\pi}_n}(s, a). \tag{11}\]

The condition for \(\hat{\pi}_n\) to be strictly improvable is that some state has a better action than the one \(\hat{\pi}_n\) uses there:

\[ \max_a Q^{\hat{\pi}_n}(s, a) > V^{\hat{\pi}_n}(s) = Q^{\hat{\pi}_n}\!\bigl(s,\, \hat{\pi}_n(s)\bigr). \tag{12}\]

That is, the best Q-value at \(s\) strictly beats the Q-value of the action \(\hat{\pi}_n\) actually takes there — so \(\hat{\pi}_n\) is committing to a suboptimal action and we have slack to exploit. The improvement step (Equation 11) is called greedy with respect to \(Q^{\hat{\pi}_n}\).

The “greedy” here means local in time: at one state \(s\) we swap to the best-looking action, and immediately revert to \(\hat{\pi}_n\) for every step afterwards. We do not search multiple steps deep, we do not commit to deviating again at any later time, and we do not jointly optimise a sequence of future actions — just one one-step swap. The only lookahead in the choice is the one already baked into \(Q^{\hat{\pi}_n}(s, a) = R(s, a) + \gamma \sum_{s’} T(s’ \mid s, a)\, V^{\hat{\pi}_n}(s’)\), i.e. one step of dynamics under \(\hat{\pi}_n\).

There is a policy improvement theorem that guarantees \(V^{\hat{\pi}_{n+1}}(s) \geq V^{\hat{\pi}_n}(s)\) for every state \(s\), with strict inequality at any state where \(\hat{\pi}_n\) was suboptimal.3

The loop terminates when \(\hat{\pi}_{n+1} = \hat{\pi}_n\) everywhere — i.e. when \(\hat{\pi}_n\) is already greedy with respect to its own Q-function, no single-state improvement remains, and we call this fixed point \(\hat{\pi}^*\). Since there are finitely many deterministic policies (at most \(|\mathcal{A}|^{|\mathcal{S}|}\)) and each iteration strictly improves the value until we hit the fixed point, termination is in finitely many steps.

For the warehouse: start with some ordering rule — say, \(\hat{\pi}_0(s) = 4\) for all stock levels. Evaluate: solve the 21 linear equations of Equation 9 for \(V^{\hat{\pi}_0}(s)\). At each state \(s\), check all 11 possible order quantities and switch to whichever maximises \(Q^{\hat{\pi}_0}(s, a)\), producing \(\hat{\pi}_1\). Maybe at \(s = 1\) we find \(\hat{\pi}_1(1) = 6\) beats \(\hat{\pi}_0(1) = 4\); at \(s = 10\), \(\hat{\pi}_1(10) = 0\). We re-evaluate \(\hat{\pi}_1\), improve from to \(\hat{\pi}_2\), and so on.

1.7 Optimality via value functions

There is a more compact, less (to me) intuitive route to the same \(\hat{\pi}^*\). At the fixed point of policy iteration, the policy is greedy with respect to its own value function — the greedy update is once again the same policy. This means \(V^*\) satisfies

\[ V^*(s) = \max_{a} \left[ R(s, a) + \gamma \sum_{s'} T(s' \mid s, a)\, V^*(s') \right] = \max_a Q^*(s, a), \tag{13}\]

the Bellman optimality equation, and the optimal policy is

\[ \hat{\pi}^*(s) = \arg\max_a Q^*(s, a). \tag{14}\]

We can also solve for \(V^*\) directly, without ever representing a policy, via value iteration. Start with any \(V_0\) and repeatedly apply

\[ V_{n+1}(s) \leftarrow \max_{a} \left[ R(s, a) + \gamma \sum_{s'} T(s' \mid s, a)\, V_n(s') \right]. \tag{15}\]

This converges because the Bellman operator is a contraction in the sup norm with factor \(\gamma\) — the fixed point exists, is unique, and any starting guess gets there.4

For the warehouse: set \(V_0(s) = 0\) for all \(s\), i.e. pretend there are no future costs. The first iteration, \(V_1(s)\), picks the action that maximises this week’s expected reward at each stock level — a one-week myopic optimum. The second iteration, \(V_2(s)\), optimises this week knowing \(V_1\) handles next week — a two-week horizon. Each pass folds in one more week of lookahead, and because \(\gamma = 0.95\) the far future gets discounted away. After enough iterations, \(V_n\) has converged and we read off \(\hat{\pi}^*\).

Both routes — policy iteration and value iteration — converge to the same \(\hat{\pi}^*\). Policy iteration is often faster in practice (fewer iterations, though each is more expensive because it solves a linear system). Value iteration is simpler and generalises more naturally to approximate settings where we can’t solve the linear system exactly — which is to say, most settings of interest.

When is \(\hat{\pi}^*\) guaranteed to be deterministic? The result we have just relied on — that the argmax of \(Q^*\) gives an optimal stationary policy — is classical for finite-state-action, infinite-horizon discounted MDPs. It breaks in several settings we will or could care about:

  • Constrained MDPs (e.g. expected cost must stay below a budget): genuinely stochastic policies can be necessary, because mixing two deterministic policies lets us hit the constraint precisely without over- or under-shooting it.
  • Nonlinear-functional objectives (variance, CVaR, worst-case — often labelled risk-sensitive; see Section 1.2): the per-step argmax argument requires the linearity of \(\mathbb{E}[\cdot]\) that these objectives lack, and stochastic policies can strictly dominate.
  • Multi-agent / Markov games (the full Beer Game we set up in Section 2.6): Nash equilibria typically require mixed strategies — being unpredictable is itself strategic.
  • POMDPs viewed in observation space: the optimal policy on the belief MDP is deterministic — \(\hat{\pi}^*(b)\) — but viewed as a function of raw observation histories it is effectively stochastic and history-dependent.
  • Function approximation / continuous control: even when a deterministic optimum exists in principle, we parameterise stochastic policies (Gaussian, softmax) because they are easier to optimise via gradients.

For the warehouse MDP none of these apply, so \(\hat{\pi}^*\) is the right object. When we promote the example to the multi-agent Beer Game later, it will not be.

For the general theory — continuous-time limits, Hamilton-Jacobi-Bellman, policy gradient methods — see W. Powell (2007). I also wrote a notebook optimal control which leans more on the street-fightin’ control theory that I’ve been closer to in my prior research.

2 Partial observability: POMDP

Figure 2: Figure: Partial observation of Mrs Brown’s

Everything in the MDP development above rests upon an implausible luxury which most problems I have faced have not furnished me with: we know the state of the world \(s_t\). Every equation — Bellman expectation, Q-function, policy improvement, value iteration — takes \(s_t\) as given.

In the beer model this is not crazy— We know there are 3 crates on the shelf, so we look up \(\hat{\pi}^*(3)\) and order accordingly. That toy world is very simple and very easy to observe. But in most real problems, we don’t have that.

Relaxing the assumption of noiseless observation of the state of the world gives us a partially observable Markov decision process (POMDP). The world still has a state, it still evolves via \(T\), rewards still accrue via \(R\) — the dynamics are still Markov, and a hypothetical fully-informed observer would face the same MDP as before. What changed is on our side: our information about \(s_t\) is now incomplete.

Three things follow in the partially-observed setting:

  1. What we want hasn’t changed: maximise cumulative discounted reward.
  2. What we can condition on has changed. Instead of \(s_t\) we have a history of actions and observations. A policy is now \(\pi(a_t \mid h_t)\), where \(h_t = (a_0, o_1, a_1, o_2, \ldots, o_t)\) grows every step.
  3. We can compress that history into a belief state — a probability distribution over where we think we are — and recover the MDP machinery on a new, larger state space.

Spoiler: it looks amazingly elegant, and turns out to be amazingly less practical.

2.1 Warehouse with delivery delays

In Section 1.1 we gave ourselves instant delivery — order crates, they appear on the shelf the same week. Suppose they don’t. Suppose orders take 1–3 weeks to arrive. We place an order, it goes to the distributor, and some time later a truck shows up and loads up the store room with the crates. But while we are out the front we, for some reason, do not have time to go into the store room and count the crates. And also it is using one of those budget shipping agents where the tracking info is unreliable,5 so we can’t check the order status online.

Let’s ignore that it is slightly contrived for this example. The true state of the system at the end of the week is now \((s_t^{\text{shelf}},\, s_t^{\text{pipeline}})\): crates on the shelf and crates in transit. But we can only see the shelf — we can’t see the truck, and we don’t know exactly when it will arrive.

Concretely: each order \(a_t\) enters a pipeline and arrives after a random delay \(\ell \sim \operatorname{Uniform}\{1, 2, 3\}\) weeks. Each week, we observe \(o_t = s_t^{\text{shelf}}\) — we can count what’s on the shelf — plus how many crates arrived that week. We don’t observe how many crates are still in transit, or when they’ll arrive.

This is a minimal POMDP constructed to isolate one piece of what makes the full Beer Game (Sterman 1989) hard: partial observability about a pipeline of in-transit stock. The real game adds further complications we are not engaging with — backlog (unmet demand carries forward as obligation), multiple players etc. We bring those in at Section 2.6, when we need them. For now, the slightly contrived “pipeline is hidden” setup is just enough to give the POMDP machinery something to work with in a single-player setting.

2.2 POMDP tuples

Formally, a POMDP extends the MDP tuple to \((\mathcal{S}, \mathcal{A}, \mathcal{O}, T, O, R, \gamma)\), adding:

\(\mathcal{O}\)
Observation space — what the agent actually sees each step. For the warehouse: the pair (crates on shelf, crates that arrived this week).
\(O(o \mid s’, a) = P(\mathsf{o}_t = o \mid \mathsf{s}_t = s’, \mathsf{a}_{t-1} = a)\)
Observation model — the probabilistic link between the true state and what we see. For the warehouse: the probability of seeing shelf count \(o^{\text{shelf}}\) and delivery count \(o^{\text{arrived}}\) given the true full state \(s’ = (s’^{\text{shelf}}, s’^{\text{pipeline}})\).

The state is still Markov — the world hasn’t changed. The agent’s information about it is incomplete.

2.3 Belief states

Since we can’t condition on the full state \(s_t\), what can we condition on? The full history \(h_t = (a_0, o_1, a_1, o_2, \ldots, o_t)\) would work — it contains every shelf count and delivery we’ve ever seen, plus every order we’ve placed — but it grows without bound, so a policy \(\pi(a \mid h_t)\) is unwieldy.6

The compression is the belief state \(b_t \in \Delta(\mathcal{S})\), a probability distribution over states:

\[ b_t(s) = P(\mathsf{s}_t = s \mid h_t). \tag{16}\]

For the warehouse: \(b_t\) is a distribution over all possible (shelf, pipeline) pairs. We know the shelf exactly (we can see it), so the uncertainty is entirely about the pipeline — how many crates are in transit and when they’ll arrive. If we ordered 5 crates two weeks ago and they haven’t shown up yet, our belief assigns probability mass to “they’ll come this week” and “they’ll come next week”, with weights depending on the delay distribution and what we’ve observed so far.

Our ordering decision should depend on this whole distribution, not just the shelf count. If we’re fairly sure 5 crates are arriving this week, ordering another 5 would be wasteful. If we think the delivery might be delayed, we’d better reorder. Note also that outcomes are asymmetric — a stockout costs us $15 in foregone revenue plus $10 in reputational damage per crate, vs. $2/crate-week to hold extra — so we care about the shape of the belief, not just its mean.

Given a prior belief \(b_t\), an action \(a_t\), and a new observation \(o_{t+1}\), the belief update follows from Bayes’ rule:

\[ b_{t+1}(s') \propto O(o_{t+1} \mid s', a_t) \sum_{s \in \mathcal{S}} T(s' \mid s, a_t)\, b_t(s). \tag{17}\]

This is a predict-then-correct cycle that should look familiar from state filtering. The sum over \(s\) is the prediction step: propagate each possible current state forward through the transition dynamics, weighted by how likely we think that state is. For the warehouse: for each possible pipeline configuration, advance the clock by one week — some in-transit crates might arrive, demand draws down the shelf, our new order enters the pipeline.

The multiplication by \(O\) is the correction step: given what we actually observed, upweight states consistent with that observation and downweight the rest. For the warehouse: if no delivery arrived this week, we downweight pipeline states where a delivery was due and upweight states where everything is still in transit.

Written as sums, Equation 17 is the finite, enumerable-state form — the toy luxury we lean on to build intuition. For continuous \(\mathcal{S}\) the prediction sum is an integral, and even for discrete-but-factored state, evaluating it exactly is the hard problem we cost in Section 2.5.

2.4 A belief-space MDP

The belief state is itself a sufficient statistic: \(b_t\) captures everything in \(h_t\) that matters for predicting the future and choosing actions. That means \(b_t\) is Markov — and the POMDP is an MDP on belief space \(\Delta(\mathcal{S})\). This sufficient statistic is not the kind we meet first. The sufficient statistics of introductory statistics are finite real vectors — a sample mean, a sample variance. Here the statistic is the whole posterior \(b_t\), an entire distribution over states — the same object that recurs across statistics under many names — and we pay for that in Section 2.5.

We can write the Bellman equation for this belief-space MDP. The reward at belief \(b\), taking action \(a\), is the expected reward over states we might be in:

\[ \rho(b, a) = \sum_{s} b(s)\, R(s, a). \tag{18}\]

For the warehouse: if we believe there’s a 0.6 chance the pipeline is empty and a 0.4 chance 5 crates are arriving this week, our expected cost of ordering \(a\) more crates is a weighted average over those scenarios — overstocking in one, possibly stocking out in the other.

The value function on belief space is

\[ V^*(b) = \max_a \left[ \rho(b, a) + \gamma \sum_{o} P(o \mid b, a)\, V^*\!\bigl(\tau(b, a, o)\bigr) \right], \tag{19}\]

where \(\tau(b, a, o)\) is the updated belief after taking action \(a\) and observing \(o\) (Equation 17), and \(P(o \mid b, a) = \sum_{s’} O(o \mid s’, a) \sum_s T(s’ \mid s, a)\, b(s)\) is the marginal probability of observation \(o\).

This is structurally the same as Equation 13, but the type of the objects has changed. We have replaced state \(s\) with belief-about-state \(b\), reward \(R\) with expected reward \(\rho\) under that belief, and the sum over next states \(s’\) with a sum over observations \(o\) (each of which “deterministically” maps to a new belief via Bayes’ rule). The scare quotes are there because the map \(\tau\) is deterministic while the belief it returns is still a whole distribution over states.

So — in principle, we’ve reduced the POMDP back to an MDP and all the machinery from Section 1.5 onward applies.

2.5 What beliefs look like

The reduction in Section 2.4 looks like a free lunch: do some Bayesian bookkeeping and the POMDP becomes an MDP we already know how to solve. It is not free. It hides three costs, and they multiply.

The warehouse belief happens to be compact. We see the shelf, so all the uncertainty sits in the pipeline; and since we placed the orders ourselves, each outstanding order contributes only a distribution over its remaining delay, conditioned on not having arrived yet. So the belief factorises into a handful of small pieces. That is luck, not law — it is a feature of this toy. In general the posterior over states neither factorises nor admits any finite parameterisation, which is the first cost.

In the MDP we also plan through uncertain futures — demand is stochastic, so next week’s stock is unknown when we order. But that uncertainty resolves for free: next week we look at the shelf and read off \(s_{t+1}\). We never track a belief; nature hands us the state and we move on. In the POMDP it does not.

The state is measure-valued, not a vector. A belief \(b \in \Delta(\mathcal{S})\) is a distribution over states. When \(\mathcal{S}\) is finite that distribution is a point in the \((|\mathcal{S}|-1)\)-simplex — the warehouse’s 21 states give a 20-dimensional simplex — so \(V^*\) is a function on that simplex, not a 21-entry lookup table. When the world state is continuous it is worse: \(b_t\) is an element of an infinite-dimensional space of probability measures, and we cannot in general write it down at all. The sufficient statistic we met first — a sample mean, a few real numbers — is nothing like this object, and that is the generic situation, not the exception. We get a closed form only in the exponential-family / conjugate cases: the Kalman filter’s Gaussian belief is the famous one (see state filtering). Outside them the belief is a measure with no finite description, and calling its space “continuous” undersells the problem — a continuous parameter would be a mercy.

Updating the belief costs compute, every step. Each step folds the new observation in via Equation 17, which is a predict-then-correct pair. The prediction sum \(\sum_s T(s’ \mid s, a_t)\, b_t(s)\) is the \(|\mathcal{S}| \times |\mathcal{S}|\) transition matrix \(T_{a_t}\) applied to the belief vector \(b_t\) — an \(O(|\mathcal{S}|^2)\) matrix–vector product — and the correction (reweight each entry by \(O(o_{t+1} \mid s’, a_t)\) and renormalise) is a cheaper \(O(|\mathcal{S}|)\) pass. The MDP never paid this; \(s_t\) was handed to us, so we acted without ever running an inference step. That \(O(|\mathcal{S}|^2)\) is a luxury of this toy, not a general rate: we get it only by flattening the whole world into 21 enumerated states, so the belief is a length-21 vector and \(T_{a_t}\) an explicit matrix. The moment the state factorises — and any realistic one does, including our own (shelf, pipeline) split — the joint is too large to enumerate, and the exact belief update is just inference in a graphical model, which is #P-hard in general. So we approximate (particle filters, variational methods), with no guarantee that a given POMDP admits a cheap exact update at all (state filtering for the approximations; the complexity of Bayesian inference for why the exact problem gets so costly).

Planning branches over observations. In the MDP Bellman equation we sum over possible next states \(s’\) to take the expectation over the transition, but whichever \(s’\) occurs is observed in time for the next decision, so the agent simply reacts to it. In Equation 19 we sum over observations \(o\), and each \(o\) sends us to a different successor belief \(\tau(b, a, o)\) — hence different future actions, different future observations, and so on. With \(|\mathcal{O}|\) possible observations per step over a horizon \(H\), the tree of reachable beliefs has \(O(|\mathcal{O}|^{H})\) nodes, exponential in the horizon. That \(|\mathcal{O}|^{H}\) is, again, a luxury of this toy: it presumes a finite, enumerable \(\mathcal{O}\), true of our shelf-count-plus-deliveries observation but not of the world at large. Under a continuous observation model the sum over \(o\) is an integral and the branching set is uncountable; \(|\mathcal{O}|^{H}\) stops meaning anything, and the tree is, if anything, worse. The MDP agent never branches: it observes \(s_{t+1}\) for free and iterates a table of \(|\mathcal{S}|\) numbers.

The three costs multiply rather than add. One sweep of POMDP value iteration evaluates \(V^*\) on a measure-valued domain (no finite table), at each visited belief runs an \(O(|\mathcal{S}|^2)\) update (or an intractable one), across an \(O(|\mathcal{O}|^{H})\) tree of observation-contingent futures (an integral, when observations are continuous). The MDP agent asks “what do I do in each of \(|\mathcal{S}|\) states?” The POMDP agent asks “what do I do for each reachable distribution over those states?” — and there is no table for that.

There is some structure to exploit. In the finite case \(V^*(b)\) is piecewise linear and convex in \(b\), which exact solvers lean on — though the number of linear pieces can itself grow exponentially in the horizon. We pick that up when we get to solvers.

2.6 The Beer Game as a Dec-POMDP

Stop here, human reader. the following bit is an LLM-generated sketch of the next section, which I haven’t written properly yet.

The single-echelon warehouse from Section 2.1 had partial observability because of delivery delays — one agent who can’t see the pipeline. The full Beer Game (1989) is worse: partial observability arises from the multi-agent structure itself.

The supply chain has four players, each managing their own inventory:

\[ \text{Customer} \to \text{Retailer} \to \text{Wholesaler} \to \text{Distributor} \to \text{Factory} \]

Each player \(i\) sees only their own shelf \(s_t^{(i)}\) and the orders arriving from their downstream neighbour. They do not see:

  • how much stock anyone else has,
  • what orders other players have placed upstream,
  • how much is in transit between any pair of players,
  • what the end-customer demand actually was (only the retailer sees this, and even then only up to their available stock).

The global state of the system is the vector of all shelf levels and all pipeline contents across the four echelons — a large but well-defined MDP state. A god’s-eye planner who saw all of it would face an ordinary MDP, just a big one, and could compute an optimal joint ordering policy by the machinery of Section 1. The difficulty here is not in the dynamics; it is purely informational.

But no player sees the global state. Each player has a local observation — their own shelf and incoming orders — and must choose how much to order from their upstream supplier. From the retailer’s perspective, the wholesaler’s stock level, the distributor’s backlog, and the factory’s production schedule are all hidden state. The retailer faces a POMDP: the transition of their local state depends on the wholesaler’s actions, which depend on the distributor’s actions, and so on.

This creates the bullwhip effect. Customers want more beer; the retailer sees the spike and orders more from the wholesaler. The wholesaler sees only that enlarged order, not the end-customer demand behind it, reads it as evidence of even higher demand, and amplifies again to the distributor — who does the same to the factory. Each agent’s action is the next agent’s only observation, so the demand signal both degrades and amplifies as it propagates upstream.

Formally, each player \(i\) has:

  • State: their own shelf \(s_t^{(i)}\) and pipeline \(s_t^{(i,\text{pipe})}\) — but the pipeline depends on actions taken by player \(i+1\), which are unobserved.
  • Observation: \((s_t^{(i,\text{shelf})},\, \text{orders received from player } i-1,\, \text{deliveries received from player } i+1)\).
  • Action: how many units to order from player \(i+1\).

Each player faces a POMDP where the hidden state includes the other players’ inventories and decisions. But it’s worse than four separate POMDPs, because the agents interact: player \(i\)’s action (their order) becomes part of player \(i+1\)’s observation (incoming orders), and player \(i+1\)’s action (their shipment) becomes part of player \(i\)’s observation (incoming deliveries). Each agent’s optimal policy depends on what the other agents do, which depends on what they believe, which depends on their observations, which depend on the first agent’s actions.

This is a decentralised POMDP (Dec-POMDP). A Dec-POMDP is defined by a tuple \((\mathcal{I}, \mathcal{S}, \{\mathcal{A}^{(i)}\}, \{\mathcal{O}^{(i)}\}, T, \{O^{(i)}\}, R, \gamma)\) where \(\mathcal{I} = \{1, \ldots, n\}\) is the set of agents, each with their own action space \(\mathcal{A}^{(i)}\) and observation space \(\mathcal{O}^{(i)}\). The transition \(T(s’ \mid s, a^{(1)}, \ldots, a^{(n)})\) depends on the joint action of all agents, but each agent \(i\) chooses \(a_t^{(i)}\) based only on their own observation history \(h_t^{(i)}\). The reward \(R\) is shared — all four players in the supply chain are penalised by total system cost — but they can’t coordinate because they don’t share observations.

The single-agent POMDP was already hard because of the three costs of Section 2.5 — the measure-valued state, the per-step update, and the \(O(|\mathcal{O}|^{H})\) branching over observations. In a Dec-POMDP each agent must also reason about what the other agents believe and will do, without observing their beliefs or actions. Agent \(i\)’s optimal policy depends on agent \(j\)’s policy, which depends on agent \(j\)’s beliefs about agent \(i\)’s policy, and so on. That extra recursion is not a constant factor on the single-agent cost; it changes the complexity class. Finite-horizon single-agent POMDPs are PSPACE-complete; finite-horizon Dec-POMDPs are NEXP-complete (Bernstein et al. 2002).

Sterman’s (1989) experimental result is that human players behave as if they’re solving a much simpler problem — an MDP with \(s_t^{(i,\text{shelf})}\) as the only state. They order to fill the gap between desired and observed stock, ignoring the pipeline. This is a rational policy for the wrong model — the same too-small-state failure of the Markov property we saw in Section 1.4, now with a human committing it. The optimal policy would maintain a belief over the pipeline and the upstream chain and order against it; but that belief is the measure-valued object whose costs we counted in Section 2.5, and the multi-agent coupling only makes it worse.

2.7 POMDP while learning forward propagator

Keen to investigate Collins and Kurniawati (2021), which I believe is also summarised in Kurniawati (2022).

3 References

Ayed, and de Bézenac. 2019. “Learning Dynamical Systems from Partial Observations.” In Advances In Neural Information Processing Systems.
Bernstein, Givan, Immerman, et al. 2002. The Complexity of Decentralized Control of Markov Decision Processes.” Mathematics of Operations Research.
Collins, and Kurniawati. 2019. Partially Observable Planning and Learning for Systems with Non-Uniform Dynamics.”
———. 2021. Locally-Connected Interrelated Network: A Forward Propagation Primitive.” In Algorithmic Foundations of Robotics XIV. Springer Proceedings in Advanced Robotics.
d’Epenoux. 1963. A Probabilistic Production and Inventory Problem.” Management Science.
Howard. 1960. Dynamic Programming and Markov Processes. Dynamic Programming and Markov Processes.
Jaakkola, Singh, and Jordan. 1995. Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems.” In Advances in Neural Information Processing Systems.
Kurniawati. 2022. Partially Observable Markov Decision Processes and Robotics.” Annual Review of Control, Robotics, and Autonomous Systems.
Manne. 1960. Linear Programming and Sequential Decisions.” Management Science.
Ohsawa. 2021. Unbiased Self-Play.” arXiv:2106.03007 [Cs, Econ, Stat].
Parisotto, and Salakhutdinov. 2017. Neural Map: Structured Memory for Deep Reinforcement Learning.” arXiv:1702.08360 [Cs].
Powell, Warren. 2007. Introduction to Markov Decision Processes.” In Approximate Dynamic Programming.
Powell, Warren B. 2020. On State Variables, Bandit Problems and POMDPs.”
———. 2021. From Reinforcement Learning to Optimal Control: A Unified Framework for Sequential Decisions.” In Handbook of Reinforcement Learning and Control.
Roy, Gordon, and Thrun. 2005. Finding Approximate POMDP Solutions Through Belief Compression.” Journal of Artificial Intelligence Research.
Sterman. 1989. Modeling Managerial Behavior: Misperceptions of Feedback in a Dynamic Decision Making Experiment.” Management Science.
Thrun, Langford, and Fox. 1999. Monte Carlo Hidden Markov Models: Learning Non-Parametric Models of Partially Observable Stochastic Processes.” In Proceedings of the International Conference on Machine Learning.

Footnotes

  1. Instant delivery is unrealistic — orders-in-transit are exactly the hidden state that causes the bullwhip effect in the full Beer Game. We add delivery delays when we need them.↩︎

  2. The standard inventory-OR convention works in “forgone profit relative to maximal” rather than profit/loss, which is IMO less intuitive. That is equivalent up to an additive constant.↩︎

  3. The proof is a one-step-deviation argument: switching to a better action for one step and then following \(\hat{\pi}_n\) can only help, and \(\hat{\pi}_{n+1}\) does at least as well as that.↩︎

  4. Why specifically sup norm? I got Claude to clue me in: Because \(T(\cdot \mid s, a)\) is a probability distribution, so the operator’s output at each state is a convex combination of next-state values, and convex combinations preserve sup-norm bounds (\(|\sum_{s’} p_{s’} x_{s’}| \leq \max_{s’} |x_{s’}|\) when the \(p_{s’}\) are non-negative and sum to one). Together with \(|\max_a f(a) - \max_a g(a)| \leq \max_a |f(a) - g(a)|\) for the action-max, the discount \(\gamma\) then deflates the sup-norm distance between successive iterates by exactly \(\gamma\). \(L^2\) and other norms don’t give a contraction in general — sup norm is the natural partner for an operator built from \(\max\) and a probability kernel.↩︎

  5. from hell’s heart I stab at thee, Aramex↩︎

  6. Although see full history RL↩︎