Wherein a chapter‑like account is presented, and a matchbox machine called MENACE is described, wherein beads encoding tic‑tac‑toe moves are sampled and added or removed on wins or losses.
adaptive
agents
bandit problems
control
incentive mechanisms
learning
networks
stochastic processes
time series
utility
Figure 1
Here’s an intro to all of machine learning through a historical tale about one particular attempt to teach a machine (not a computer!) to play tic-tac-toe. Rodney Brooks, in Machine Learning Explained, introduces Donald Michie’s MENACE, a machine that plays tic-tac-toe using a box of matchboxes. Each matchbox contains a number of beads, each representing a possible move. The machine plays by randomly selecting a bead from the matchbox corresponding to the current state of the game. If it wins, it adds more beads to that matchbox; if it loses, it removes some beads. This algorithm is not optimal. But it is an interesting first start. Learning to make it optimal is a problem of reinforcement learning (RL).
1 Theory
You shouldn’t read this blog post to learn about RL. There are many fine introductions on the internet. The below introductions are great.
Nonetheless, I need to learn some more RL and fix some notation, so there is a tutorial below, comprising my own notes, if you want to read them.
Opponent shaping is a way to get agents to influence in multi-agent systems by using models of the other agents. This was interesting enough to break out into its own notebook; see opponent shaping.
A (possibly stochastic) rule for how the world updates when you take action \(a_t\) in \(s_t\). E.g. in CartPole, push left/right → new pole‑angle and cart‑position.
For the purposes of describing the problem we write this out, but in fact if we are bothering with RL we do not actually know this. All we do is sample from it. E.g. in CartPole, you don’t predict the next angle of the pole, but you can sample it by running the simulation.
I mean, you can predict it in CartPole, but you are not required to, and totally black box systems are amenable to RL.
The scalar “feedback” we get when you go from \(s_t\) to \(s_{t+1}\) via \(a_t\). High reward = good; low (or negative) = bad.
Once again, we do not know this function, we just sample from it after taking an action. E.g. in CartPole, you get a reward of +1 for every time step you keep the pole upright.
Discount factor\(\displaystyle \gamma\in[0,1]\)
How much we prefer “now” rewards over “later” rewards where \(\gamma=0\) → we only care about immediate reward; \(\gamma\to1\) → we care a lot about the future.
Initial state distribution\(\displaystyle \rho_0(s)\)
Where episodes start: you sample \(s_0\sim\rho_0\). Often a fixed start state or a uniform/random choice.
People often seem to suppress this in the problem; I don’t think it adds much complexity in practice.
Horizon\(T\) (or termination condition)
Max length of an episode (or “done” flag). If infinite, we rely on \(\gamma<1\) to keep returns finite. This is often suppressed also.
The next few are definitions that we find convenient to make, dependent upon the above.
Policy\(\displaystyle \pi_\theta(a\mid s)\)
Our agent’s “strategy”: a mapping from state→distribution over actions. Parameterized by \(\theta\) (e.g. the weights of a neural net).
We aim to pick \(\theta\) to make expected return as large as possible.
Now if we knew \(P\) and \(R\) and their forms were tractable, we could just use dynamic programming to solve the MDP; let us assume we do not know them.
We want to maximise the expected return \[
J(\theta)
= \mathbb{E}_{\tau\sim\pi_\theta}\Bigl[\sum_{t=0}^{T-1}\gamma^t\,r_t\Bigr].
\] Using the score function trick, we can write \[
\nabla_\theta \pi_\theta(\tau)
= \pi_\theta(\tau)\,\nabla_\theta \log \pi_\theta(\tau),
\] so we get \[
\nabla_\theta J(\theta)
= \int \! \pi_\theta(\tau)\,\nabla_\theta \log \pi_\theta(\tau)\,R(\tau)\,d\tau
= \mathbb{E}_{\tau}\bigl[\nabla_\theta \log \pi_\theta(\tau)\,R(\tau)\bigr].
\] Since \(\log \pi_\theta(\tau)=\sum_{t=0}^{T-1}\log \pi_\theta(a_t\mid s_t)\), it follows that \[
\boxed{\;\nabla_\theta J(\theta)
=\mathbb{E}_{\tau}\Bigl[\sum_{t=0}^{T-1}\nabla_\theta\log \pi_\theta(a_t\mid s_t)\;G_t\Bigr],
\quad G_t=\sum_{k=t}^{T-1}\gamma^{k-t}r_k\;.}
\]
Monte Carlo approximation of the expectation is straightforward. In practice, we sample \(N\) full episodes \(\{\tau^{(i)}\}_{i=1}^N\) and approximate:
This yields the following gradient ascent update rule: \[
\theta \;\leftarrow\; \theta \;+\;\alpha\;\frac1N\sum_{i,t}
\nabla_\theta\log\pi_\theta(a_t^{(i)}\mid s_t^{(i)})\,G_t^{(i)}.
\]
I got an LLM to generate me an implementation of this in PyTorch:
Policy learning
import torchimport torch.nn as nnimport torch.optim as optimdef compute_returns(rewards, gamma):""" Input: list of rewards [r0, r1, ..., r_{T-1}] Output: torch.Tensor of discounted returns [G0, G1, ..., G_{T-1}] """ returns = [] R =0.0for r inreversed(rewards): R = r + gamma * R returns.insert(0, R)return torch.tensor(returns, dtype=torch.float32)def select_action(policy_net, state):""" Samples an action and returns (action, log_prob). Assumes policy_net(state) returns a PyTorch distribution. """ dist = policy_net(state) a = dist.sample()return a, dist.log_prob(a)def train_policy(policy_net, optimizer, env, num_episodes, gamma=0.99):for episode inrange(num_episodes): state = env.reset() log_probs = [] rewards = []# 1) Generate one full episode done =Falsewhilenot done: state_tensor = torch.tensor(state, dtype=torch.float32) action, logp = select_action(policy_net, state_tensor) next_state, r, done, _ = env.step(action.item()) log_probs.append(logp) rewards.append(r) state = next_state# 2) Compute discounted returns G_t returns = compute_returns(rewards, gamma) # shape [T]# 3) Policy loss: -(sum_t logπ(a_t|s_t) * G_t) / T log_probs = torch.stack(log_probs) # shape [T] loss =-(log_probs * returns).mean()# 4) Gradient ascent step optimizer.zero_grad() loss.backward() optimizer.step()# (Optional) logging total_return =sum(rewards)if episode %10 was 0, print(f"Ep {episode:4d} Return: {total_return:.2f}")def train_policy(policy_net, optimizer, env, num_episodes, gamma=0.99):for episode inrange num_episodes: state = env.reset() log_probs = [] rewards = []# 1) Generate one full episode done =Falsewhilenot done: state_tensor = torch.tensor(state, dtype=torch.float32) action, logp = select_action(policy_net, state_tensor) next_state, r, done, _ = env.step(action.item()) log_probs.append(logp) rewards.append(r) state = next_state# 2) Compute discounted returns G_t returns = compute_returns(rewards, gamma) # shape [T]# 3) Policy loss: -(sum_t logπ(a_t|s_t) * G_t) / T log_probs = torch.stack(log_probs) # shape [T] loss =-(log_probs * returns).mean()# 4) Gradient ascent step optimizer.zero_grad() loss.backward() optimizer.step()# (Optional) logging total return=sum(rewards)if episode %10 was 0, print(f"Ep {episode:4d} Return: {total_return:.2f}")# Example setupenv = YourGymEnv()policy_net = YourPolicyNet() # returns a torch.distributions.Distributionoptimizer = optim.Adam(policy_net.parameters(), lr=1e-3)train_policy(policy_net, optimizer, env, num_episodes=1000)
There you go! A minimum viable policy method. In practice, we would be way more sophisticated than this.
9 Value function methods
Another way to slice sequential learning is to learn a value function, \(Q(s,a)\).
Think of \(Q(s,a)\) as “how good is it to take action \(a\) in state \(s\), and then keep acting optimally afterward?” If we somehow knew the true \(Q^*(s,a)\), the problem would be done — just pick \[
a = \arg\max_a Q^*(s,a),
\] and you’re acting optimally.
This idea comes from the Bellman equation, which works backward from the future: \[
Q^*(s,a) = \mathbb{E}\Bigl[,r + \gamma \max_{a'} Q^*(s',a') ;\Bigm|; s,a\Bigr].
\] It says: the value of doing \(a\) in \(s\) is the immediate reward \(r\) plus the discounted value of the best thing you could do next. Same recursive logic as in optimal control — just applied to learning by trial and error.
So we can think of the return from \((s,a)\) as having two pieces:
Immediate reward\(r\), for doing \(a\) in \(s\).
Future value, the best we can expect from wherever we land next, \(\gamma \max_{a'} Q(s_{t+1},a')\).
The neat part is we don’t need to know the full dynamics of the environment. We just sample transitions \((s,a) \to (r,s')\) as we go and update our guesses.
9.1 Bootstrapping and the Temporal Difference update
When we actually take \(a_t\) in \(s_t\) and see \((r_t, s_{t+1})\), we can form a “one-step bootstrap” target — basically our best guess at what the true value should be: \[
r_t + \gamma \max_{a'} Q(s_{t+1},a').
\] Compare that to what we thought before, \(Q(s_t,a_t)\). The difference between them is the temporal-difference (TD) error: \[
\delta_t = r_t + \gamma \max_{a'} Q(s_{t+1},a') - Q(s_t,a_t).
\] Then we nudge our estimate in the right direction: \[
Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha,\delta_t.
\] Here, \(\alpha\) is the learning rate — how fast we update — and \(\gamma\) is the usual discount factor for future rewards.
9.2 Exploration
If we just keep picking the current best-looking action, we get stuck fast — often with a overconfident bad choice. So we mix in some randomness.
The simplest hack is ε-greedy:
With probability \(\epsilon\), pick a random action (“explore”).
With probability \(1 - \epsilon\), pick the action with the highest \(Q(s,a)\) (“exploit”).
Usually \(\epsilon\) starts high and decays over time — curious early, confident later.
I find ε-greedy ugly but it works and it’s easy to code. There are more elegant exploration schemes — Bayesian ones, for instance — which show up again in bandit problems.
LLM-generated conversion of that word salad into PyTorch:
This book provides a broad introduction to algorithms for decision making under uncertainty. We cover a wide variety of topics related to decision making, introducing the underlying mathematical problem formulations and the algorithms for solving them.
Includes much of interest, including multi-agent learning.