Here’s an intro to all of machine learning through a historical tale about one particular attempt to teach a machine (not a computer!) to play tic-tac-toe. Rodney Brooks, in Machine Learning Explained, introduces Donad Michie’s MENACE, a machine that plays tic-tac-toe using a box of matchboxes. Each matchbox contains a number of beads, each representing a possible move. The machine plays by randomly selecting a bead from the matchbox corresponding to the current state of the game. If it wins, it adds more beads to that matchbox; if it loses, it removes some beads. This algorithm is not optimal. But it is an interesting first start. Learning to make it optimal is a problem of reinforcement learning (RL).
1 Theory
You shouldn’t read this blog post to learn about RL. There are many fine introductions on the internet. The below introductions are great.
Nonetheless, I need to learn some more RL and fix some notation, so there is a tutorial below, comprising my own notes, if you want to read them.
Opponent shaping is a way to get agents to influence in multi-agent systems by using models of the other agents. This was interesting enough to break out into its own notebook; see opponent shaping.
What RL is deisgned to “solve”. An MDP is given by
State space
All the possible configurations the environment can be in. At time the env “is” in some state .
Action space
All the moves or controls the agent can choose. Agent picks when in state .
Transition dynamics
A (possibly stochastic) rule for how the world updates when you take action in . E.g. in CartPole, push left/right → new pole‑angle and cart‑position.
For the purposes of describing the problem we write this out, but in fact if we are bothering with RL we do not actually know this. All we do is sample from it. E.g. in CartPole, you don’t predict the next angle of the pole, but you can sample it by running the simulation.
I mean, you can predict it in CartPOle, but you are not required to, and totally black box systems are amenable to RL.
Reward function
The scalar “feedback” we get when you go from to via . High reward = good; low (or negative) = bad.
Once again, we do not know this function, we just sample from it after taking an action E.g. in CartPole, you get a reward of +1 for every time step you keep the pole upright.
Discount factor
How much we prefer “now” rewards over “later” rewards where → we only care about immediate reward; → wecare a lot about the future.
Initial state distribution
Where episodes start: you sample . Often a fixed start state or a uniform/random choice.
People often seem to suppress this in the problem; I don’t think it add much complexity in practice.
Horizon (or termination condition)
Max length of an episode (or “done” flag). If infinite, we rely on to keep returns finite. This is often suppressed also.
The next few are definitions that we find convenient to make, dependent upon the above.
Policy
Our agent’s “strategy”: a mapping from state→distribution over actions. Parameterized by (e.g. the weights of a neural net).
Trajectory
One full run‑through of states, actions, and rewards until done.
Return
The total (discounted) sum of rewards from time onward — What we’re trying to maximize on average.
Objective
We aim to pick to make expected return as large as possible.
Now if we knew and and their forms were tractable, we could just use dynamic programming to solve the MDP; let us assume we do not know them.
We want to maximize the expected return Using the core function trick, we can write so we get Since , it follows that
Monte Carlo approximation of the expectation is straightforward. In practice, we sample full episodes and approximate:
This yield the following gradient ascent update rule:
I got an LLM to generate me an implementation of this in PyTorch:
Policy learning
import torchimport torch.nn as nnimport torch.optim as optimdef compute_returns(rewards, gamma):""" Input: list of rewards [r0, r1, ..., r_{T-1}] Output: torch.Tensor of discounted returns [G0, G1, ..., G_{T-1}] """ returns = [] R =0.0for r inreversed(rewards): R = r + gamma * R returns.insert(0, R)return torch.tensor(returns, dtype=torch.float32)def select_action(policy_net, state):""" Samples an action and returns (action, log_prob). Assumes policy_net(state) returns a PyTorch distribution. """ dist = policy_net(state) a = dist.sample()return a, dist.log_prob(a)def train_policy(policy_net, optimizer, env, num_episodes, gamma=0.99):for episode inrange(num_episodes): state = env.reset() log_probs = [] rewards = []# 1) Generate one full episode done =Falsewhilenot done: state_tensor = torch.tensor(state, dtype=torch.float32) action, logp = select_action(policy_net, state_tensor) next_state, r, done, _ = env.step(action.item()) log_probs.append(logp) rewards.append(r) state = next_state# 2) Compute discounted returns G_t returns = compute_returns(rewards, gamma) # shape [T]# 3) Policy loss: -(sum_t logπ(a_t|s_t) * G_t) / T log_probs = torch.stack(log_probs) # shape [T] loss =-(log_probs * returns).mean()# 4) Gradient ascent step optimizer.zero_grad() loss.backward() optimizer.step()# (Optional) logging total_return =sum(rewards)if episode %10==0:print(f"Ep {episode:4d} Return: {total_return:.2f}")def train_policy(policy_net, optimizer, env, num_episodes, gamma=0.99):for episode inrange(num_episodes): state = env.reset() log_probs = [] rewards = []# 1) Generate one full episode done =Falsewhilenot done: state_tensor = torch.tensor(state, dtype=torch.float32) action, logp = select_action(policy_net, state_tensor) next_state, r, done, _ = env.step(action.item()) log_probs.append(logp) rewards.append(r) state = next_state# 2) Compute discounted returns G_t returns = compute_returns(rewards, gamma) # shape [T]# 3) Policy loss: -(sum_t logπ(a_t|s_t) * G_t) / T log_probs = torch.stack(log_probs) # shape [T] loss =-(log_probs * returns).mean()# 4) Gradient ascent step optimizer.zero_grad() loss.backward() optimizer.step()# (Optional) logging total_return =sum(rewards)if episode %10==0:print(f"Ep {episode:4d} Return: {total_return:.2f}")# Example setupenv = YourGymEnv()policy_net = YourPolicyNet() # returns a torch.distributions.Distributionoptimizer = optim.Adam(policy_net.parameters(), lr=1e-3)train_policy(policy_net, optimizer, env, num_episodes=1000)
There you go! a minimum viable policy method. In practice we would be way more sophisticated than this.
7.1 Value function methods
We can decompose the sequentiial learning slightly differently, learning a value function, . We can think of as “how good is it to take action in state , (and then behave optimally thereafter)?” If we know , the optimal policy would be trivial: always pick .
This come from a Bellman equation which decomposes the rewards by working backwards to find the value of a decision now, given “more of the same” later. I recognise this from optimal control. The full return from can be decomposed into
Immediate reward for taking in .
Best possible future from the next state .
Formally:
We don’t need a model: we just need to sample .
When we take action in and observe , we form a “one-step bootstrap” target (or “Bellman backup”? Apparently people say that): The Time Difference (TD) error is We then adjust toward the target by a fraction :
The learning rate balances how much weight we give to new info vs. old. is the discount factor we saw before.
In practice this doesn’t work and we get spuriously confident about terrible actions we choose early on, so we use an exploration strategy. The simplest one that can possibly work is ε‑greedy: (With probability ε, pick an action by some random method (“explore”). With probability , pick (“exploit”). We typically decay ε over time: start exploratory, then settle into purely greedy.)
I hate that, but since I want some compact code I’ll use it for now. Some other exploration strategies are explored in bandit problems, including Bayesian ones that are to my mind more intuitive even if they are not quite as simple.
LLM-generated conversion of that word salad into PyTorch:
This book provides a broad introduction to algorithms for decision making under uncertainty. We cover a wide variety of topics related to decision making, introducing the underlying mathematical problem formulations and the algorithms for solving them.
Includes much of interest, including multi-agent learning.