Wherein is related a historical account of MENACE, a matchbox machine taught to play tic‑tac‑toe by adding and removing beads as reward, and basic policy‑ and value‑based RL methods are outlined.
adaptive
agents
bandit problems
control
incentive mechanisms
learning
networks
stochastic processes
time series
utility
Figure 1
Here’s an intro to machine learning through a historical tale about one attempt to teach a machine (not a computer!) to play tic-tac-toe. Rodney Brooks, in Machine Learning Explained, introduces Donald Michie’s MENACE, a machine that plays tic-tac-toe using a box of matchboxes. Each matchbox contains a number of beads, each representing a possible move. The machine plays by randomly selecting a bead from the matchbox corresponding to the current state of the game. If it wins, it adds more beads to that matchbox; if it loses, it removes some beads. This algorithm is not optimal. But it’s an interesting first step. Learning to make it optimal is a reinforcement learning (RL) problem.
1 Theory
This isn’t the post to learn RL from; there are many fine introductions online. The introductions listed below are great.
Nonetheless, I need to learn some more RL and fix some notation, so there is a tutorial below, comprising my own notes, if we want to read them.
Opponent shaping is when agents influence other agents in multi-agent systems by using models of the other agents. This topic was interesting enough to get its own notebook; see opponent shaping.
A (possibly stochastic) rule for how the world updates when we take action \(a_t\) in state \(s_t\). E.g. in CartPole, push left/right → new pole‑angle and cart‑position.
For the purposes of describing the problem we write this down, but in practice we don’t actually know it; all we can do is sample from it. E.g. in CartPole, we don’t predict the next angle of the pole, but we can sample it by running the simulation.
I mean, we can predict it in CartPole, but we aren’t required to, and fully black‑box systems are amenable to RL.
The scalar “feedback” we get when we go from \(s_t\) to \(s_{t+1}\) via \(a_t\). High reward = good; low (or negative) = bad.
Once again, we don’t know this function; we just sample it after taking an action. E.g. in CartPole, we get a reward of +1 for every time step we keep the pole upright.
Discount factor\(\displaystyle \gamma\in[0,1]\)
How much we prefer “now” rewards over “later” rewards: \(\gamma=0\) → we only care about immediate reward; \(\gamma\to1\) → we care a lot about the future.
Initial state distribution\(\displaystyle \rho_0(s)\)
Where episodes start: we sample \(s_0\sim\rho_0\). Often a fixed start state or a uniform/random choice.
People often suppress this in the problem; I don’t think it adds much complexity in practice.
Horizon\(T\) (or termination condition)
Max length of an episode (or “done” flag). If infinite, we rely on \(\gamma<1\) to keep returns finite. This is often suppressed, too.
The next few are definitions that we find convenient to make, dependent upon the above.
Policy\(\displaystyle \pi_\theta(a\mid s)\)
Our agent’s “strategy”: a mapping from state→distribution over actions. Parameterized by \(\theta\) (e.g. the weights of a neural net).
We aim to choose \(\theta\) to maximize the expected return.
If we knew \(P\) and \(R\), and their functional forms were tractable, we could solve the MDP using dynamic programming. Let’s assume we don’t know those functions.
We want to maximize the expected return. \[
J(\theta)
= \mathbb{E}_{\tau\sim\pi_\theta}\Bigl[\sum_{t=0}^{T-1}\gamma^t\,r_t\Bigr].
\] Using the score function trick, we can write: \[
\nabla_\theta \pi_\theta(\tau)
= \pi_\theta(\tau)\,\nabla_\theta \log \pi_\theta(\tau),
\] So we get \[
\nabla_\theta J(\theta)
= \int \! \pi_\theta(\tau)\,\nabla_\theta \log \pi_\theta(\tau)\,R(\tau)\,d\tau
= \mathbb{E}_{\tau}\bigl[\nabla_\theta \log \pi_\theta(\tau)\,R(\tau)\bigr].
\] Since \(\log \pi_\theta(\tau)=\sum_{t=0}^{T-1}\log \pi_\theta(a_t\mid s_t)\), we have \[
\boxed{\;\nabla_\theta J(\theta)
=\mathbb{E}_{\tau}\Bigl[\sum_{t=0}^{T-1}\nabla_\theta\log \pi_\theta(a_t\mid s_t)\;G_t\Bigr],
\quad G_t=\sum_{k=t}^{T-1}\gamma^{k-t}r_k\;.}
\]
Approximating the expectation with Monte Carlo is straightforward. To do this in practice, we sample \(N\) full episodes \(\{\tau^{(i)}\}_{i=1}^N\) and approximate the expectation as:
We then obtain the following gradient-ascent update rule: \[
\theta \;\leftarrow\; \theta \;+\;\alpha\;\frac1N\sum_{i,t}
\nabla_\theta\log\pi_\theta(a_t^{(i)}\mid s_t^{(i)})\,G_t^{(i)}.
\]
I asked an LLM to generate a PyTorch implementation of this:
Policy learning
import torchimport torch.nn as nnimport torch.optim as optimdef compute_returns(rewards, gamma):""" Input: list of rewards [r0, r1, ..., r_{T-1}] Output: torch.Tensor of discounted returns [G0, G1, ..., G_{T-1}] """ returns = [] R =0.0for r inreversed(rewards): R = r + gamma * R returns.insert(0, R)return torch.tensor(returns, dtype=torch.float32)def select_action(policy_net, state):""" Samples an action and returns (action, log_prob). Assumes policy_net(state) returns a PyTorch distribution. """ dist = policy_net(state) a = dist.sample()return a, dist.log_prob(a)def train_policy(policy_net, optimizer, env, num_episodes, gamma=0.99):for episode inrange(num_episodes): state = env.reset() log_probs = [] rewards = []# 1) Generate one full episode done =Falsewhilenot done: state_tensor = torch.tensor(state, dtype=torch.float32) action, logp = select_action(policy_net, state_tensor) next_state, r, done, _ = env.step(action.item()) log_probs.append(logp) rewards.append(r) state = next_state# 2) Compute discounted returns G_t returns = compute_returns(rewards, gamma) # shape [T]# 3) Policy loss: -(sum_t logπ(a_t|s_t) * G_t) / T log_probs = torch.stack(log_probs) # shape [T] loss =-(log_probs * returns).mean()# 4) Gradient ascent step optimizer.zero_grad() loss.backward() optimizer.step()# (Optional) logging total_return =sum(rewards)if episode %10 was 0, print(f"Ep {episode:4d} Return: {total_return:.2f}")def train_policy(policy_net, optimizer, env, num_episodes, gamma=0.99):for episode inrange num_episodes: state = env.reset() log_probs = [] rewards = []# 1) Generate one full episode done =Falsewhilenot done: state_tensor = torch.tensor(state, dtype=torch.float32) action, logp = select_action(policy_net, state_tensor) next_state, r, done, _ = env.step(action.item()) log_probs.append(logp) rewards.append(r) state = next_state# 2) Compute discounted returns G_t returns = compute_returns(rewards, gamma) # shape [T]# 3) Policy loss: -(sum_t logπ(a_t|s_t) * G_t) / T log_probs = torch.stack(log_probs) # shape [T] loss =-(log_probs * returns).mean()# 4) Gradient ascent step optimizer.zero_grad() loss.backward() optimizer.step()# (Optional) logging total return=sum(rewards)if episode %10 was 0, print(f"Ep {episode:4d} Return: {total_return:.2f}")# Example setupenv = YourGymEnv()policy_net = YourPolicyNet() # returns a torch.distributions.Distributionoptimizer = optim.Adam(policy_net.parameters(), lr=1e-3)train_policy(policy_net, optimizer, env, num_episodes=1000)
There we go! A minimum viable policy method. In practice, we’d be much more sophisticated than this.
10 Value function methods
Another way to approach sequential learning is to learn a value function, \(Q(s,a)\).
Think of \(Q(s,a)\) as “how good it is to take action \(a\) in state \(s\) and then act optimally afterwards?” If we knew the true \(Q^*(s,a)\), the problem would be solved — we’d just pick \[
a = \arg\max_a Q^*(s,a),
\] We’re acting optimally.
This idea comes from the Bellman equation, which reasons backward from the future: \[
Q^*(s,a) = \mathbb{E}\Bigl[,r + \gamma \max_{a'} Q^*(s',a') ;\Bigm|; s,a\Bigr].
\] This says that the value of doing \(a\) in \(s\) is the immediate reward \(r\) plus the discounted value of the best action we could take next. It’s the same recursive logic as in optimal control — just applied to learning by trial and error.
So we can think of the return from \((s,a)\) as having two pieces:
Immediate reward\(r\) for taking \(a\) in \(s\).
Future value, the best we can expect from the next state, \(\gamma \max_{a'} Q(s_{t+1},a')\).
The neat part is that we don’t need to know the environment’s full dynamics. We just sample transitions \((s,a) \to (r,s')\) as we go and update our guesses.
10.1 Bootstrapping and the Temporal Difference update
When we take \(a_t\) in \(s_t\) and observe \((r_t, s_{t+1})\), we form a “one-step bootstrap” target — our best single-step guess at the true value: \[
r_t + \gamma \max_{a'} Q(s_{t+1},a').
\] Compare that with what we thought before: \(Q(s_t,a_t)\). That difference is the temporal-difference (TD) error: \[
\delta_t = r_t + \gamma \max_{a'} Q(s_{t+1},a') - Q(s_t,a_t).
\] Then we nudge our estimate in the right direction: \[
Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha,\delta_t.
\] Here, \(\alpha\) is the learning rate — how quickly we update — and \(\gamma\) is the usual discount factor for future rewards.
10.2 Exploration
If we keep picking the current best-looking action, we get stuck quickly — we often land on an overconfident bad choice. So, we mix in some randomness.
The simplest hack is ε-greedy:
With probability \(\epsilon\), pick a random action (“explore”).
With probability \(1 - \epsilon\), pick the action with the highest \(Q(s,a)\) (“exploit”).
Usually \(\epsilon\) starts high and decays over time — curious early, confident later.
I find ε-greedy ugly, but it works and it’s easy to implement. There are more elegant exploration schemes — Bayesian ones, for instance — which we’ll see again in bandit problems.
Here’s an LLM-generated conversion of that word salad into PyTorch:
A grand challenge in reinforcement learning is intelligent exploration, especially when rewards are sparse or deceptive. Two Atari games serve as benchmarks for such hard-exploration domains: Montezuma’s Revenge and Pitfall. On both games, current RL algorithms perform poorly, even those with intrinsic motivation, which is the dominant method to improve performance on hard-exploration domains. To address this shortfall, we introduce a new algorithm called Go-Explore. It exploits the following principles: (1) remember previously visited states, (2) first return to a promising state (without exploration), then explore from it, and (3) solve simulated environments through any available means (including by introducing determinism), then robustify via imitation learning. The combined effect of these principles is a dramatic performance improvement on hard-exploration problems.
This post focuses on addressing this very obstacle: on-policy RL cannot learn from prompts for which the model’s generated traces receive zero reward. We first describe how classical exploration methods, such as exploration bonuses (see this cool paper), are not sufficient in the LLM setting and often lead to optimization pathologies. We then show how a more “proactive” approach based on conditioning on offline data from privileged sources can overcome this exploration bottleneck and enable RL to scale more effectively on hard problems.
This book provides a broad introduction to algorithms for decision making under uncertainty. We cover a wide variety of topics related to decision making, introducing the underlying mathematical problem formulations and the algorithms for solving them.
We found a lot of interesting material here, including multi-agent learning.