Markov decision problems
2014-11-27 — 2026-04-16
Wherein a beer warehouse is employed as the running example, the belief state is established as a sufficient statistic for partial observations, and the multi-agent extension is noted.
Partially Observable Markov Decision Processes (POMDPs) are a meeting-point for many different methods and frameworks for sequential decision-making under uncertainty. They also give us a nice clean formalism for acting in the world, about which it is easy to build intuition, before leaping into the extensions we need in practice (approximations, non-stationary versions, multi-agent versions, online-learning versions, etc).
See also optimal control, reinforcement learning
1 MDP: the fully observable case
A Markov decision process (MDP) is a discrete-time stochastic control problem. We define one by specifying a tuple \((\mathcal{S}, \mathcal{A}, T, R, \gamma)\):
- \(\mathcal{S}\)
- State space.
- \(\mathcal{A}\)
- Action space.
- \(T(s’ \mid s, a) = P(s_{t+1} = s’ \mid s_t = s, a_t = a)\)
- Transition dynamics — the physics.
- \(R(s, a)\)
- Reward, or running cost. A function of the current state and action.
- \(\gamma \in [0, 1)\)
- Discount factor.
Each time step: the agent sees state \(s_t\), picks action \(a_t \sim \pi(\cdot \mid s_t)\), pays or earns \(R(s_t, a_t)\), and the world advances to \(s_{t+1} \sim T(\cdot \mid s_t, a_t)\).
1.1 Running example: the beer warehouse
I’ll return repeatedly to a concrete inventory problem throughout. This is a single-player, single-echelon version of Sterman’s Beer Game (1989) — we generalise to the multi-player, partially-observable supply chain in Section 2.
A bottle shop manages stock of a single product: crates of beer. Each week, the state \(s_t \in \{0, 1, \ldots, 20\}\) is crates on hand. The action \(a_t \in \{0, 1, \ldots, 10\}\) is how many crates to order from the distributor, delivered instantly.1 Customer demand \(d_t\) is i.i.d. with \(d_t \sim \operatorname{Poisson}(4)\), so about 4 crates per week on average, but with a fat right tail — a run on beer is always possible.
Transition dynamics:
\[ s_{t+1} = \min\!\bigl(\max(s_t + a_t - d_t,\; 0),\; 20\bigr). \tag{1}\]
Clamping at zero means excess demand is lost (no backorders); clamping at 20 means the warehouse is full. This defines \(T(s’ \mid s, a)\) implicitly through the demand distribution.
Now for the costs. Each crate costs $10 to order. Holding a crate in the warehouse for a week costs $2 (refrigeration, opportunity cost of capital, floor space). Running out of stock costs $25 per crate of unmet demand — lost sales plus the reputational damage of telling a customer we have no beer.
The per-step reward is then:
\[ R(s, a) = -\bigl[\underbrace{10 a}_{\text{ordering}} + \underbrace{2(s + a)}_{\text{holding}} + \underbrace{25 \cdot \mathbb{E}[\max(d_t - s - a,\; 0)]}_{\text{stockout}}\bigr]. \tag{2}\]
(Everything is a cost, so \(R\) is always negative; less negative is better.)
Suppose we have \(s_t = 3\) crates on hand and order \(a_t = 5\). Ordering costs \(10 \times 5 = \$50\). We hold \(3 + 5 = 8\) crates, costing \(2 \times 8 = \$16\). With 8 crates available and \(d_t \sim \operatorname{Poisson}(4)\), the probability of a stockout (\(d_t > 8\)) is about 0.02 — small but not zero — so the expected stockout penalty is \(25 \times \mathbb{E}[\max(d_t - 8, 0)] \approx \$0.60\). Total: \(R(3, 5) \approx -\$66.60\).
We can compute \(R(3, 5)\) without knowing what \(s_{t+1}\) turns out to be. Holding cost is for what’s on the shelf now; ordering cost is for what we ordered now; the stockout penalty depends on the demand distribution and current available stock, not on next week.
1.2 The causal picture
The warehouse illustrates a general pattern. An MDP describes a system in some configuration \(s\), subject to control input \(a\), incurring a running cost that depends on the configuration and the control, not on where the system lands next. The transition kernel \(T(s’ \mid s, a)\) is the dynamics — how the configuration evolves — and is separate from the accounting in \(R\).
I find this confusing; why is the cost not determined by the world (\(R(s)\)) rather than the agent’s actions (\(R(s, a)\))? Why is it not determined by the outcome of the action (\(R(s, a, s’)\))?
The answer is that the (\(R(s, a)\)) version is minimal and general.
e.g. Some formulations write \(R(s, a, s’)\), making the reward depend on the transition outcome. This is equivalent: we can always marginalise,
\[ R(s, a) = \sum_{s’} T(s’ \mid s, a)\, r(s, a, s’), \tag{3}\]
absorbing the “where did I land?” accounting into the next step’s value rather than the current step’s reward. In the warehouse, we could define \(r(s, a, s’)\) to penalise the realised stockout for a specific demand draw rather than its expectation; marginalising over demand recovers Equation 2.
1.3 The objective and discounting
We want to maximise total discounted reward (the return):
\[ G = \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t). \tag{4}\]
This is an additive objective — per-step costs just accumulate, which is the natural accounting for a warehouse that pays its bills every week. That said, additivity does rule out some things we might care about: minimising worst-case single-period cost, or minimising the variance of costs, would give non-additive objectives where Bellman’s recursion doesn’t apply directly.2
We set \(\gamma = 0.95\), which means a dollar of cost next week is equivalent to 95 cents this week. The discount factor makes the infinite sum converge, and gives us a mild preference for deferring costs. For the warehouse it could reflect the time value of money, or just be a mathematical convenience.
1.4 The Markov property and the burden on \(\mathcal{S}\)
The Markov property is what makes all this tractable:
\[ P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t) = T(s_{t+1} \mid s_t, a_t). \tag{5}\]
The future is conditionally independent of the past given the present state. For the warehouse: if I know there are 3 crates on the shelf and I’m ordering 5 more, the history of past orders and past demand tells me nothing additional about next week’s stock level.
For our simple warehouse, the Markov property holds because we set it up that way — demand is i.i.d., delivery is instant, and the transition (Equation 1) depends only on current stock plus current order minus current demand. But now suppose demand has a weekly cycle (more beer on Fridays) and we haven’t put the day-of-week in \(\mathcal{S}\). Suddenly the same stock level of 3 crates has different implications depending on whether it’s Tuesday or Friday, and the history does contain information (what day is it?) that the state alone doesn’t capture. The Markov property has failed — not because the physics changed, but because our state is too small.
The real warehouse has whatever dynamics it has — customers show up when they show up, beer spoils at whatever rate it spoils. Markovity is a claim about whether our model’s state variable captures enough of that reality for the history to be redundant. The same warehouse can be modelled with a Markov state or a non-Markov one, depending on what we include in \(\mathcal{S}\). The state, with respect to our model, must be a sufficient statistic — it must absorb enough of the world’s actual structure that, conditional on \(s_t\), the past adds no further predictive power. If disgruntled customers defect after a stockout, “customer goodwill” had better be in \(\mathcal{S}\). If there’s a seasonal trend, the season had better be in there too.
When we break full observability in Section 2, we’ll see this same failure mode formalised: an agent that conditions on the wrong information is implicitly using the wrong state.
1.5 Bellman recursion
Given a policy \(\pi(a \mid s)\), the value function is the expected return from state \(s\):
\[ V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{k=0}^{\infty} \gamma^k R(s_{t+k}, a_{t+k}) \;\middle|\; s_t = s\right]. \tag{6}\]
We can split this into the immediate reward plus the discounted future. The trick is to peel off the first term:
\[ V^\pi(s) = \sum_{a} \pi(a \mid s) \left[ R(s, a) + \gamma \sum_{s’} T(s’ \mid s, a)\, V^\pi(s’) \right]. \tag{7}\]
This is the Bellman expectation equation. The value of being in state \(s\) equals the immediate reward plus \(\gamma\) times the expected value of wherever we land, averaged over our action choice (policy) and the stochastic transition (dynamics).
For the warehouse with \(\gamma = 0.95\): the value of having 3 crates on hand, under some ordering policy \(\pi\), is this week’s expected cost \(R(3, \pi(3))\) — about \(-\$67\) if we order 5 — plus \(0.95\) times the expected value of whatever stock level we find ourselves at next Monday. That second term averages over demand: if \(d_t = 2\) we’ll have 6 crates left, if \(d_t = 7\) we’ll have 1, and the value function \(V^\pi\) tells us how costly each of those situations is going forward.
\(V^\pi\) tells us how good each state is under policy \(\pi\), but it doesn’t let us compare actions — it has already averaged over whatever \(\pi\) does. To ask “what if I deviated and ordered 5 instead of 4 this week, then went back to following \(\pi\)?” we need the action-value function (Q-function):
\[ Q^\pi(s, a) = R(s, a) + \gamma \sum_{s’} T(s’ \mid s, a)\, V^\pi(s’), \tag{8}\]
so that \(V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a)\). \(Q^\pi(3, 5)\) is the expected total discounted cost if we have 3 crates, order 5 this week, and follow \(\pi\) from next week onward. \(Q^\pi(3, 4)\) is the same but ordering 4. Comparing them tells us which order quantity is better at this state, under this policy — and that comparison is exactly the mechanism we need to improve a policy.
1.6 Optimal policies via policy improvement
The direct route to an optimal policy is policy iteration, which I believe originates with Howard (1960), though I learned it from a sheaf of poorly attributed photocopied notes.
Given some policy \(\pi\), the Q-function lets us ask at each state: is there an action that does better than what \(\pi\) prescribes? That is, we check whether
\[ \max_a Q^\pi(s, a) > V^\pi(s) \tag{9}\]
at any state \(s\). If so, we build an improved policy \(\pi’\) that acts greedily with respect to \(Q^\pi\):
\[ \pi’(s) = \arg\max_a Q^\pi(s, a). \tag{10}\]
The policy improvement theorem guarantees that \(V^{\pi’}(s) \geq V^\pi(s)\) for all \(s\), with strict inequality at any state where \(\pi\) was suboptimal.3
Policy iteration alternates two phases:
- Evaluate: compute \(V^\pi\) (solve the linear system Equation 7).
- Improve: set \(\pi’(s) = \arg\max_a Q^\pi(s, a)\) for all \(s\).
Repeat until the policy stops changing. Since there are finitely many deterministic policies and each step improves, this terminates — at a policy \(\pi^*\) where no single-state improvement is possible.
For the warehouse: start with some ordering rule — say, always order 4 crates regardless of stock. Evaluate it: solve for \(V^\pi(s)\) at each stock level (this is a system of 21 linear equations). Then at each state, check all 11 possible order quantities and switch to whichever gives the best \(Q^\pi(s, a)\). Maybe at \(s = 1\) we find that ordering 6 is better than ordering 4; at \(s = 10\) we find ordering 0 is better. Re-evaluate the new policy, improve again, and so on. In practice this converges in a handful of iterations.
1.7 Optimality via value functions
There is an equivalent, more compact route. At the fixed point of policy iteration, the policy is greedy with respect to its own value function — no improvement is possible. This means \(V^*\) satisfies
\[ V^*(s) = \max_{a} \left[ R(s, a) + \gamma \sum_{s’} T(s’ \mid s, a)\, V^*(s’) \right] = \max_a Q^*(s, a), \tag{11}\]
the Bellman optimality equation, and the optimal policy is
\[ \pi^*(s) = \arg\max_a Q^*(s, a). \tag{12}\]
We can also solve for \(V^*\) directly, without ever representing a policy, via value iteration: start with any \(V_0\) and repeatedly apply
\[ V_{n+1}(s) \leftarrow \max_{a} \left[ R(s, a) + \gamma \sum_{s’} T(s’ \mid s, a)\, V_n(s’) \right]. \tag{13}\]
This converges because the Bellman operator is a contraction in the sup norm with factor \(\gamma\) — the fixed point exists, is unique, and any starting guess gets there.
For the warehouse: set \(V_0(s) = 0\) for all \(s\), i.e. pretend there are no future costs. The first iteration, \(V_1(s)\), just picks the action that minimises this week’s cost at each stock level — a one-week myopic optimum. The second iteration, \(V_2(s)\), optimises this week knowing \(V_1\) handles next week — a two-week horizon. Each pass folds in one more week of lookahead, and because \(\gamma = 0.95\) the far future gets discounted away. After enough iterations (a few hundred for 21 states), \(V_n\) has converged and we read off \(\pi^*\).
Both routes — policy iteration and value iteration — converge to the same \(\pi^*\). Policy iteration is often faster in practice (fewer iterations, though each is more expensive because it solves a linear system). Value iteration is simpler and generalises more naturally to approximate settings where we can’t solve the linear system exactly — which is to say, most settings of interest, including the function-approximation world where transformers live.
For the general theory — continuous-time limits, Hamilton-Jacobi-Bellman, policy gradient methods — see optimal control or, like, W. Powell (2007).
2 POMDP: partial observability
“A POMDP is a partially observable Markov decision process. It is a model, originating in the operations research (OR) literature, for describing planning tasks in which the decision maker does not have complete information as to its current state. The POMDP model provides a convenient way of reasoning about tradeoffs between actions to gain reward and actions to gain information.”
Everything in the MDP development above rests on one luxury: we see \(s_t\). We know there are 3 crates on the shelf, so we look up \(\pi^*(3)\) and order accordingly. Every equation — Bellman expectation, Q-function, policy improvement, value iteration — takes \(s_t\) as given.
In the partially observed case we don’t have that. The world still has a state, it still evolves via \(T\), rewards still accrue via \(R\). The world is an MDP. We are the ones with the problem — our information about \(s_t\) is incomplete.
Three things follow in the partially-observed setting:
- What we want hasn’t changed: maximise cumulative discounted reward.
- What we can condition on has changed. Instead of \(s_t\) we have a history of actions and observations. A policy is now \(\pi(a_t \mid h_t)\), where \(h_t = (a_0, o_1, a_1, o_2, \ldots, o_t)\) grows every step.
- We can compress that history into a belief state — a probability distribution over where we think we are — and recover the MDP machinery on a new, larger state space.
2.1 The warehouse with delivery delays
In Section 1.1 we gave ourselves instant delivery — order crates, they appear on the shelf the same week. Suppose they don’t. Suppose orders take 1–3 weeks to arrive. We place an order, it goes to the distributor, and some time later a truck shows up.
The true state of the system is now \((s_t^{\text{shelf}},\, s_t^{\text{pipeline}})\): crates on the shelf and crates in transit. But we can only see the shelf — we can’t see the truck, and we don’t know exactly when it will arrive.
Concretely: each order \(a_t\) enters a pipeline and arrives after a random delay \(\ell \sim \operatorname{Uniform}\{1, 2, 3\}\) weeks. Each week, we observe \(o_t = s_t^{\text{shelf}}\) — we can count what’s on the shelf — plus how many crates arrived that week (the truck either shows up or it doesn’t). But we don’t observe how many crates are still in transit, or when they’ll arrive.
This is the mechanism behind the Beer Game (1989) which motivated this running example.
The transition now involves both the shelf and the pipeline, but we only see part of it.4
2.2 The POMDP tuple
Formally, a POMDP extends the MDP tuple to \((\mathcal{S}, \mathcal{A}, \mathcal{O}, T, O, R, \gamma)\), adding:
- \(\mathcal{O}\)
- Observation space — what the agent actually sees each step. For the warehouse: the pair (crates on shelf, crates that arrived this week).
- \(O(o \mid s’, a) = P(o_t \mid s_t = s’, a_{t-1} = a)\)
- Observation model — the probabilistic link between the true state and what we see. For the warehouse: the probability of seeing shelf count \(o^{\text{shelf}}\) and delivery count \(o^{\text{arrived}}\) given the true full state \(s’ = (s’^{\text{shelf}}, s’^{\text{pipeline}})\).
The state is still Markov — the world hasn’t changed. The agent’s information about it is incomplete.
2.3 The belief state
Since we can’t condition on the full state \(s_t\), what can we condition on? The full history \(h_t = (a_0, o_1, a_1, o_2, \ldots, o_t)\) would work — it contains every shelf count and delivery we’ve ever seen, plus every order we’ve placed — but it grows without bound, so a policy \(\pi(a \mid h_t)\) is unwieldy.
The compression is the belief state \(b_t \in \Delta(\mathcal{S})\), a probability distribution over states:
\[ b_t(s) = P(s_t = s \mid h_t). \tag{14}\]
For the warehouse: \(b_t\) is a distribution over all possible (shelf, pipeline) pairs. We know the shelf exactly (we can see it), so the uncertainty is entirely about the pipeline — how many crates are in transit and when they’ll arrive. If we ordered 5 crates two weeks ago and they haven’t shown up yet, our belief assigns probability mass to “they’ll come this week” and “they’ll come next week”, with weights depending on the delay distribution and what we’ve observed so far.
Our ordering decision should depend on this whole distribution, not just the shelf count. If we’re fairly sure 5 crates are arriving tomorrow, ordering another 5 would be wasteful. If we think the delivery might be delayed, we’d better reorder. The asymmetry in costs (stockouts at $25/crate vs. holding at $2/crate) means we care about the shape of the belief, not just its mean.
Given a prior belief \(b_t\), an action \(a_t\), and a new observation \(o_{t+1}\), the belief update follows from Bayes’ rule:
\[ b_{t+1}(s’) \propto O(o_{t+1} \mid s’, a_t) \sum_{s \in \mathcal{S}} T(s’ \mid s, a_t)\, b_t(s). \tag{15}\]
This is a predict-then-correct cycle that should look familiar from state filtering. The sum over \(s\) is the prediction step: propagate each possible current state forward through the transition dynamics, weighted by how likely we think that state is. For the warehouse: for each possible pipeline configuration, advance the clock by one week — some in-transit crates might arrive, demand draws down the shelf, our new order enters the pipeline.
The multiplication by \(O\) is the correction step: given what we actually observed, upweight states consistent with that observation and downweight the rest. For the warehouse: if no delivery arrived this week, we downweight pipeline states where a delivery was due and upweight states where everything is still in transit.
This predict-correct structure is the same one that appears in Kalman filtering (linear-Gaussian systems) and particle filtering (nonlinear ones).
2.4 The belief-space MDP
The belief state is itself a sufficient statistic: \(b_t\) captures everything in \(h_t\) that matters for predicting the future and choosing actions. That means \(b_t\) is Markov — and the POMDP is an MDP on belief space \(\Delta(\mathcal{S})\).
We can write the Bellman equation for this belief-space MDP. The reward at belief \(b\), taking action \(a\), is the expected reward over states we might be in:
\[ \rho(b, a) = \sum_{s} b(s)\, R(s, a). \tag{16}\]
For the warehouse: if we believe there’s a 0.6 chance the pipeline is empty and a 0.4 chance 5 crates are arriving this week, our expected cost of ordering \(a\) more crates is a weighted average over those scenarios — overstocking in one, possibly stocking out in the other.
The value function on belief space is
\[ V^*(b) = \max_a \left[ \rho(b, a) + \gamma \sum_{o} P(o \mid b, a)\, V^*\!\bigl(\tau(b, a, o)\bigr) \right], \tag{17}\]
where \(\tau(b, a, o)\) is the updated belief after taking action \(a\) and observing \(o\) (Equation 15), and \(P(o \mid b, a) = \sum_{s’} O(o \mid s’, a) \sum_s T(s’ \mid s, a)\, b(s)\) is the marginal probability of observation \(o\).
This is structurally identical to Equation 11. We have replaced state \(s\) with belief \(b\), reward \(R\) with expected reward \(\rho\), and the sum over next states \(s’\) with a sum over observations \(o\) (each of which deterministically maps to a new belief via Bayes’ rule).
So — in principle, we’ve reduced the POMDP back to an MDP and all the machinery from Section 1.5 onward applies.
2.5 What beliefs actually look like
Before we celebrate, we should look at belief space and realise that it’s actually messy and inconvenient.
In the warehouse, the belief has a lot of structure. We observe the shelf directly, so there’s no uncertainty there — \(b_t\) assigns all its mass to shelf states consistent with what we just counted. The uncertainty is entirely about the pipeline: which past orders are still in transit and when they’ll arrive. And we know what we ordered and when — we placed those orders ourselves. The unknown is delivery timing.
So, say we ordered 5 crates two weeks ago. Delivery takes 1–3 weeks. They didn’t arrive in week 1. At that point our belief updates: the delay is 2 or 3 weeks, equally likely (since we’ve ruled out 1). They don’t arrive in week 2 either. Now we know they’re coming in week 3 — the belief about this order has collapsed to a point. But there’s still uncertainty about the order we placed last week.
More generally, the belief at any time is determined by: the shelf count (observed), plus for each outstanding order, a distribution over its remaining delivery time (conditioned on it not having arrived yet).
In this example the beliefs have nice structure — the shelf is observed, the pipeline uncertainty factorises by order — so the belief itself is a compact object. That won’t always be the case; in general, beliefs can be high-dimensional and multimodal with no convenient factorisation. But even when the belief is tractable to represent, the POMDP is harder than the MDP in ways worth spelling out.
In the MDP, we also plan through uncertain futures — demand is stochastic, so next week’s stock level is unknown when we choose our order. The Bellman equation (Equation 11) sums over possible next states. But that uncertainty resolves for free: next Monday we just look at the shelf and see \(s_{t+1}\). We never have to track or update a belief about the state — nature hands it to us, and we move on.
In the POMDP, three things are different:
The belief is always “continuous”. For the MDP warehouse, value iteration fills in a table: one number per state, 21 entries. For the POMDP, \(V^*\) is a function over belief space, and beliefs are continuous even when \(\mathcal{S}\) is finite. We can’t just enumerate them.
Belief updating costs compute. Each time step requires Bayesian inference — the sum-then-multiply in Equation 15 — to fold the new observation into the belief. In the MDP we never paid this cost; \(s_t\) was simply given. For small finite state spaces the update is a matrix–vector multiply, which is cheap. But in larger or continuous state spaces, exact Bayesian updating becomes intractable and we need approximations (particle filters, variational methods, etc.).
Planning branches over observations, not just states. In the MDP Bellman equation, we sum over next states \(s’\) — but each of those states will be observed, so the agent can react to whichever one actually occurs. In Equation 17, we sum over observations \(o\), and each observation leads to a different future belief \(\tau(b, a, o)\), which leads to different future actions, which lead to different future observations, and so on. We’re planning over a tree of possible belief trajectories, not a set of possible next states. The optimal action at belief \(b\) might be “order conservatively, because if the truck arrives tomorrow the belief will shift and we can adjust, but if it doesn’t we need enough stock to last” — that kind of contingent reasoning over future information is what the MDP agent gets for free by simply observing \(s_{t+1}\).
These three costs compound. In the MDP, we iterated a table of 21 numbers to convergence. In the POMDP, each iteration requires evaluating \(V^*\) at beliefs we haven’t seen before (continuous domain), updating those beliefs by Bayesian inference (per-step compute), and branching over possible observations at every future time step (exponential tree). The MDP agent plans by asking “what should I do in each state?” The POMDP agent plans by asking “what should I do given each possible history of partial information?” — a much larger question.
TODO — when we discuss solvers, we’ll see that \(V^*(b)\) has exploitable structure in the finite case (it is piecewise linear and convex in \(b\), which exact solvers leverage), but the number of pieces can grow exponentially with the horizon.
2.6 The Beer Game: partial observability from multi-agent structure
The single-echelon warehouse from Section 2.1 had partial observability because of delivery delays — one agent who can’t see the pipeline. The full Beer Game (1989) is worse: partial observability arises from the multi-agent structure itself.
The supply chain has four players, each managing their own inventory:
\[ \text{Customer} \to \text{Retailer} \to \text{Wholesaler} \to \text{Distributor} \to \text{Factory} \]
Each player \(i\) sees only their own shelf \(s_t^{(i)}\) and the orders arriving from their downstream neighbour. They do not see:
- how much stock anyone else has,
- what orders other players have placed upstream,
- how much is in transit between any pair of players,
- what the end-customer demand actually was (only the retailer sees this, and even then only up to their available stock).
The global state of the system is the vector of all shelf levels and all pipeline contents across the four echelons — a large but well-defined MDP state. If a single planner saw the whole thing, they could run value iteration (or policy iteration, if the state space cooperated) and compute an optimal joint ordering policy.
But no player sees the global state. Each player has a local observation — their own shelf and incoming orders — and must choose how much to order from their upstream supplier. From the retailer’s perspective, the wholesaler’s stock level, the distributor’s backlog, and the factory’s production schedule are all hidden state. The retailer faces a POMDP: the transition of their local state depends on the wholesaler’s actions, which depend on the distributor’s actions, and so on.
This creates the bullwhip effect. The retailer sees demand spike (say customers suddenly want more beer) and places a larger order with the wholesaler. The wholesaler sees that larger order — but not the end-customer demand that caused it — and interprets it as a signal that demand is even higher, so they amplify the order to the distributor. The distributor does the same to the factory. Each echelon is doing local inference on partial observations, and each one’s actions become noisy observations for the next, so the signal degrades and amplifies as it propagates upstream.
Formally, each player \(i\) has:
- State: their own shelf \(s_t^{(i)}\) and pipeline \(s_t^{(i,\text{pipe})}\) — but the pipeline depends on actions taken by player \(i+1\), which are unobserved.
- Observation: \((s_t^{(i,\text{shelf})},\, \text{orders received from player } i-1,\, \text{deliveries received from player } i+1)\).
- Action: how many units to order from player \(i+1\).
Each player faces a POMDP where the hidden state includes the other players’ inventories and decisions. But it’s worse than four separate POMDPs, because the agents interact: player \(i\)’s action (their order) becomes part of player \(i+1\)’s observation (incoming orders), and player \(i+1\)’s action (their shipment) becomes part of player \(i\)’s observation (incoming deliveries). Each agent’s optimal policy depends on what the other agents do, which depends on what they believe, which depends on their observations, which depend on the first agent’s actions.
This is a decentralised POMDP (Dec-POMDP). A Dec-POMDP is defined by a tuple \((\mathcal{I}, \mathcal{S}, \{\mathcal{A}^{(i)}\}, \{\mathcal{O}^{(i)}\}, T, \{O^{(i)}\}, R, \gamma)\) where \(\mathcal{I} = \{1, \ldots, n\}\) is the set of agents, each with their own action space \(\mathcal{A}^{(i)}\) and observation space \(\mathcal{O}^{(i)}\). The transition \(T(s’ \mid s, a^{(1)}, \ldots, a^{(n)})\) depends on the joint action of all agents, but each agent \(i\) chooses \(a_t^{(i)}\) based only on their own observation history \(h_t^{(i)}\). The reward \(R\) is shared — all four players in the supply chain are penalised by total system cost — but they can’t coordinate because they don’t share observations.
The single-agent POMDP was already hard because we had to plan over a tree of possible belief trajectories (Section 2.5). In a Dec-POMDP, each agent must also reason about what the other agents believe and will do — but can’t observe their beliefs or actions directly. Agent \(i\)’s optimal policy depends on agent \(j\)’s policy, which depends on agent \(j\)’s beliefs about agent \(i\)’s policy, and so on. This mutual dependence is what pushes Dec-POMDPs into a harder complexity class: finite-horizon Dec-POMDPs are NEXP-complete (Bernstein et al. 2002), compared to PSPACE-complete for single-agent POMDPs. The “DEC” in front adds an exponential.
Sterman’s (1989) experimental result is that human players behave as if they’re solving a much simpler problem — an MDP with \(s_t^{(i,\text{shelf})}\) as the only state. They order to fill the gap between desired and observed stock, ignoring the pipeline. This is a rational policy for the wrong model: an MDP where the state is too small and the Markov property fails. The optimal policy would maintain a belief over the pipeline and the upstream supply chain, and order accordingly — but maintaining that belief is exactly the hard part we identified in Section 2.5, and in the multi-agent case it’s harder still.
2.7 POMDP while learning forward propagator
Keen to investigate Collins and Kurniawati (2021), which I believe is also summarised in Kurniawati (2022).
3 References
Footnotes
Instant delivery is unrealistic — orders-in-transit are exactly the hidden state that causes the bullwhip effect in the full Beer Game. We’ll add delivery delays when we need them.↩︎
There are extensions — constrained MDPs, risk-sensitive MDPs — but we leave them aside.↩︎
The proof is a one-step-deviation argument: switching to a better action for one step and then following \(\pi\) can only help, and \(\pi’\) does at least as well as that.↩︎
There are other sources of partial observability in a more useful version of the problem. For example, a subtler one is: censored demand. When we stock out, we see zero crates remaining but we don’t know how many customers walked away. Our observation of demand is truncated at available stock. This creates an exploration-exploitation tension — stocking more is costly but also informative.↩︎

