Reinforcement learning

2014-11-27 — 2026-01-13

Wherein Michie’s MENACE matchboxes are evoked to introduce policy‑gradient and value methods, exploration pathologies and Go‑Explore are outlined, and simple PyTorch recipes are provided.

adaptive
agents
bandit problems
control
incentive mechanisms
learning
networks
stochastic processes
time series
utility
Figure 1

Here’s an intro to machine learning through a historical tale about one attempt to teach a machine (not a computer!) to play tic-tac-toe. Rodney Brooks, in Machine Learning Explained, introduces Donald Michie’s MENACE, a machine that plays tic-tac-toe using a box of matchboxes. Each matchbox contains a number of beads, each representing a possible move. The machine plays by randomly selecting a bead from the matchbox corresponding to the current state of the game. If it wins, it adds more beads to that matchbox; if it loses, it removes some beads. This algorithm is not optimal. But it’s an interesting first step. Learning to make it optimal is a reinforcement learning (RL) problem.

1 Theory

This isn’t the post to learn RL from; there are many fine introductions online. The introductions listed below are great.

Nonetheless, I need to learn some more RL and fix some notation, so there’s a tutorial below comprising my own notes, if we want to read it.

2 Opponent shaping

Opponent shaping is when agents influence other agents in multi-agent systems by modelling them. This topic was interesting enough to get its own notebook; see opponent shaping.

3 Practice

4 Variants

4.1 Without reward

See empowerment.

4.2 With generative action spaces

Learning to act with generative models.

4.3 Hierarchical

See hierarchical reinforcement learning.

4.4 Via diffusion

Is Conditional Generative Modeling all you need for Decision-Making? (Ajay et al. 2023).

5 Tutorial

5.1 Markov decision process

What RL is designed to “solve”. An MDP is given by \((\mathcal{S},\mathcal{A},P,R,\gamma)\)

  1. State space \(\displaystyle \mathcal S\)

    All possible configurations the environment can be in. At time \(t\), the environment is in some state \(s_t\in\mathcal S\).

  2. Action space \(\displaystyle \mathcal A\)

    All moves or controls the agent can choose. The agent picks \(a_t\in\mathcal A\) when in state \(s_t\).

  3. Transition dynamics \(\displaystyle P\bigl(s_{t+1}\mid s_t,a_t\bigr)\)

    A (possibly stochastic) rule for how the world updates when we take action \(a_t\) in state \(s_t\). E.g. in CartPole, push left/right → new pole‑angle and cart‑position.

    For the purposes of describing the problem we write this down, but in practice we don’t actually know it; we can only sample from it. E.g. in CartPole, we don’t predict the next angle of the pole, but we can sample it by running the simulation.

    I mean, we can predict it in CartPole, but we aren’t required to, and fully black‑box systems are amenable to RL.

  4. Reward function \(\displaystyle R\bigl(s_t,a_t,s_{t+1}\bigr)\)

    The scalar “feedback” we get when we go from \(s_t\) to \(s_{t+1}\) via \(a_t\). High reward = good; low (or negative) = bad.

    Once again, we don’t know this function; we just sample it after taking an action. E.g. in CartPole, we get a reward of +1 for every time step we keep the pole upright.

  5. Discount factor \(\displaystyle \gamma\in[0,1]\)

    How much we prefer “now” rewards over “later” rewards: \(\gamma=0\) → we only care about immediate reward; \(\gamma\to1\) → we care a lot about the future.

  6. Initial state distribution \(\displaystyle \rho_0(s)\)

    Where episodes start: we sample \(s_0\sim\rho_0\). Often a fixed start state or a uniform/random choice.

    People often suppress this in the problem; I don’t think it adds much complexity in practice.

  7. Horizon \(T\) (or termination condition)

    Max length of an episode (or “done” flag). If infinite, we rely on \(\gamma<1\) to keep returns finite. This is often suppressed, too.

The next few are definitions that we find convenient to make, dependent upon the above.

  1. Policy \(\displaystyle \pi_\theta(a\mid s)\)

    Our agent’s “strategy”: a mapping from state→distribution over actions. Parameterized by \(\theta\) (e.g. the weights of a neural net).

  2. Trajectory \(\displaystyle \tau = (s_0,a_0,r_0,\dots,s_{T})\)

    One full run‑through of states, actions, and rewards until done.

  3. Return \(\displaystyle G_t=\sum_{k=t}^{T-1}\gamma^{\,k-t}\,r_k\)

    The total (discounted) sum of rewards from time \(t\) onward — what we’re trying to maximize on average.

  4. Objective \[ J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\bigl[G_0\bigr] = \mathbb{E}\Bigl[\sum_{t=0}^{T-1}\gamma^t\,r_t\Bigr]. \]

We aim to choose \(\theta\) to maximize the expected return.

If we knew \(P\) and \(R\) and their functional forms were tractable, we could solve the MDP using dynamic programming. But we don’t know those functions, so we’ll need another approach.

5.2 Policy gradient

Figure 2

The simplest way to learn \(\theta\) is REINFORCE, aka the score function gradient estimator.

We want to maximize the expected return. \[ J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\Bigl[\sum_{t=0}^{T-1}\gamma^t\,r_t\Bigr]. \] Using the score function trick, we can write: \[ \nabla_\theta \pi_\theta(\tau) = \pi_\theta(\tau)\,\nabla_\theta \log \pi_\theta(\tau), \] So we get \[ \nabla_\theta J(\theta) = \int \! \pi_\theta(\tau)\,\nabla_\theta \log \pi_\theta(\tau)\,R(\tau)\,d\tau = \mathbb{E}_{\tau}\bigl[\nabla_\theta \log \pi_\theta(\tau)\,R(\tau)\bigr]. \] Since \(\log \pi_\theta(\tau)=\sum_{t=0}^{T-1}\log \pi_\theta(a_t\mid s_t)\), we have \[ \boxed{\;\nabla_\theta J(\theta) =\mathbb{E}_{\tau}\Bigl[\sum_{t=0}^{T-1}\nabla_\theta\log \pi_\theta(a_t\mid s_t)\;G_t\Bigr], \quad G_t=\sum_{k=t}^{T-1}\gamma^{k-t}r_k\;.} \]

Approximating the expectation with Monte Carlo is straightforward. In practice, we sample \(N\) full episodes \(\{\tau^{(i)}\}_{i=1}^N\) and approximate the expectation as:

\[ \nabla_\theta J(\theta) \approx \frac1N\sum_{i=1}^N\sum_{t=0}^{T-1} \nabla_\theta\log\pi_\theta\bigl(a_t^{(i)}\mid s_t^{(i)}\bigr)\,G_t^{(i)}. \]

We then obtain the following gradient-ascent update rule: \[ \theta \;\leftarrow\; \theta \;+\;\alpha\;\frac1N\sum_{i,t} \nabla_\theta\log\pi_\theta(a_t^{(i)}\mid s_t^{(i)})\,G_t^{(i)}. \]

I asked an LLM to generate a PyTorch implementation of this:

Policy learning
import torch
import torch.nn as nn
import torch.optim as optim


def compute_returns(rewards, gamma):
    """
    Input: list of rewards [r0, r1, ..., r_{T-1}]
    Output: torch.Tensor of discounted returns [G0, G1, ..., G_{T-1}]
    """
    returns = []
    R = 0.0
    for r in reversed(rewards):
        R = r + gamma * R
        returns.insert(0, R)
    return torch.tensor(returns, dtype=torch.float32)


def select_action(policy_net, state):
    """
    Samples an action and returns (action, log_prob).
    Assumes policy_net(state) returns a PyTorch distribution.
    """
    dist = policy_net(state)
    a = dist.sample()
    return a, dist.log_prob(a)


def train_policy(policy_net, optimizer, env, num_episodes, gamma=0.99):
    for episode in range(num_episodes):
        state = env.reset()
        log_probs = []
        rewards = []

        # 1) Generate one full episode
        done = False
        while not done:
            state_tensor = torch.tensor(state, dtype=torch.float32)
            action, logp = select_action(policy_net, state_tensor)

            next_state, r, done, _ = env.step(action.item())
            log_probs.append(logp)
            rewards.append(r)
            state = next_state

        # 2) Compute discounted returns G_t
        returns = compute_returns(rewards, gamma)  # shape [T]

        # 3) Policy loss: -(sum_t logπ(a_t|s_t) * G_t) / T
        log_probs = torch.stack(log_probs)  # shape [T]
        loss = -(log_probs * returns).mean()

        # 4) Gradient ascent step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # (Optional) logging
        total_return = sum(rewards)
        if episode % 10 was 0, print(f"Ep {episode:4d}  Return: {total_return:.2f}")

def train_policy(policy_net, optimizer, env, num_episodes, gamma=0.99):
    for episode in range num_episodes:
        state = env.reset()
        log_probs = []
        rewards = []

        # 1) Generate one full episode
        done = False
        while not done:
            state_tensor = torch.tensor(state, dtype=torch.float32)
            action, logp = select_action(policy_net, state_tensor)

            next_state, r, done, _ = env.step(action.item())
            log_probs.append(logp)
            rewards.append(r)
            state = next_state

        # 2) Compute discounted returns G_t
        returns = compute_returns(rewards, gamma)  # shape [T]

        # 3) Policy loss: -(sum_t logπ(a_t|s_t) * G_t) / T
        log_probs = torch.stack(log_probs)  # shape [T]
        loss = -(log_probs * returns).mean()

        # 4) Gradient ascent step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # (Optional) logging
        total return = sum(rewards)
        if episode % 10 was 0, print(f"Ep {episode:4d}  Return: {total_return:.2f}")


# Example setup
env = YourGymEnv()
policy_net = YourPolicyNet()  # returns a torch.distributions.Distribution
optimizer = optim.Adam(policy_net.parameters(), lr=1e-3)

train_policy(policy_net, optimizer, env, num_episodes=1000)

There we go — a minimum viable policy method. In practice, we’d be much more sophisticated than this.

5.3 Value function methods

Another way to approach sequential learning is to learn a value function, \(Q(s,a)\).

Think of \(Q(s,a)\) as “how good it is to take action \(a\) in state \(s\) and then act optimally afterwards.” If we knew the true \(Q^*(s,a)\), the problem would be solved — we’d just pick \[ a = \arg\max_a Q^*(s,a), \] We’re acting optimally.

This idea comes from the Bellman equation, which reasons backward from the future: \[ Q^*(s,a) = \mathbb{E}\Bigl[,r + \gamma \max_{a'} Q^*(s',a') ;\Bigm|; s,a\Bigr]. \] This says the value of doing \(a\) in \(s\) equals the immediate reward \(r\) plus the discounted value of the best action we can take next. It’s the same recursive logic as in optimal control — just applied to learning by trial and error.

So we can think of the return from \((s,a)\) as having two parts:

  1. Immediate reward \(r\) for taking \(a\) in \(s\).

  2. Future value: the best we can expect from the next state, \(\gamma \max_{a'} Q(s_{t+1},a')\).

The neat part is that we don’t need to know the environment’s full dynamics. We just sample transitions \((s,a) \to (r,s')\) as we go and update our estimates.

5.4 Bootstrapping and the Temporal Difference update

When we take \(a_t\) in \(s_t\) and observe \((r_t, s_{t+1})\), we form a “one-step bootstrap” target — our best single-step estimate of the true value: \[ r_t + \gamma \max_{a'} Q(s_{t+1},a'). \] Compare that with what we thought before: \(Q(s_t,a_t)\). That’s the temporal-difference (TD) error: \[ \delta_t = r_t + \gamma \max_{a'} Q(s_{t+1},a') - Q(s_t,a_t). \] Then we nudge our estimate in the right direction: \[ Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha,\delta_t. \] Here, \(\alpha\) is the learning rate — how quickly we update — and \(\gamma\) is the usual discount factor for future rewards.

5.5 Exploration

If we keep picking the current best-looking action, we get stuck quickly — we often end up with an overconfident bad choice. So we mix in some randomness.

The simplest hack is ε-greedy:

  • With probability \(\epsilon\), we pick a random action (“explore”).
  • With probability \(1 - \epsilon\), we pick the action with the highest \(Q(s,a)\) (“exploit”).

Usually \(\epsilon\) starts high and decays over time — curious early, confident later.

I find ε-greedy ugly, but it works and it’s easy to implement. There are more elegant exploration schemes — Bayesian ones, for instance — which we’ll see again in bandit problems.

Here’s an LLM-generated conversion of that word salad into PyTorch:

Q Learning
import torch
import torch.nn as nn
import torch.optim as optim
import random

class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, action_dim)
        )
    def forward(self, state):
        # state: torch.Tensor of shape [batch, state_dim]
        return self.net(state)  # returns [batch, action_dim]

def select_action(q_net, state, eps, device):
    """
    ε‑greedy selection
    state: numpy array or list → torch.Tensor [1, state_dim]
    returns: action (int)
    """
    if random.random() < eps:
        return random.randrange(q_net.net[-1].out_features)
    state_v = torch.tensor([state], dtype=torch.float32, device=device)
    with torch.no_grad():
        q_vals = q_net(state_v)  # [1, A]
    return int(q_vals.argmax(dim=1).item())

def train_qlearning(
    env,
    q_net,
    optimizer,
    num_episodes,
    gamma=0.99,
    eps_start=1.0,
    eps_end=0.01,
    eps_decay=0.995,
    device="cpu"
):
    eps = eps_start
    for ep in range(1, num_episodes + 1):
        state = env.reset()
        done = False
        total_reward = 0.0

        while not done:
            # 1) Pick action
            a = select_action(q_net, state, eps, device)

            # 2) Step environment
            next_state, r, done, _ = env.step(a)
            total_reward += r

            # 3) Compute TD target
            state_v      = torch.tensor([state],      dtype=torch.float32, device=device)
            next_state_v = torch.tensor([next_state], dtype=torch.float32, device=device)
            q_vals       = q_net(state_v)            # [1, A]
            next_q_vals  = q_net(next_state_v)       # [1, A]
            q_sa         = q_vals[0, a]
            target       = r + (0.0 if done else gamma * next_q_vals.max().item())

            # 4) Compute loss & backprop
            loss = (q_sa - target) ** 2
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # 5) Move on
            state = next_state

        # 6) Decay epsilon
        eps = max(eps_end, eps * eps_decay)

        # Logging
        if ep % 10 == 0:
            print(f"Episode {ep:4d} | Return: {total_reward:.2f} | ε={eps:.3f}")

# Example usage:
import gym

env       = gym.make("CartPole-v1")
device    = torch.device("cuda" if torch.cuda.is_available() else "cpu")
q_net     = QNetwork(state_dim=env.observation_space.shape[0],
                     action_dim=env.action_space.n).to(device)
optimizer = optim.Adam(q_net.parameters(), lr=1e-3)

train_qlearning(env, q_net, optimizer, num_episodes=500)

5.6 Actor-critic

TBC

6 Regularization: PPO

TBC

7 Model-based RL

TBC

8 With memory

The article Uber has cracked two classic ’80s video games by giving an AI algorithm a new type of memory summarizes the paper by Ecoffet et al. (2021).

A grand challenge in reinforcement learning is intelligent exploration, especially when rewards are sparse or deceptive. Two Atari games serve as benchmarks for such hard-exploration domains: Montezuma’s Revenge and Pitfall. On both games, current RL algorithms perform poorly, even those with intrinsic motivation, which is the dominant method to improve performance on hard-exploration domains. To address this shortfall, we introduce a new algorithm called Go-Explore. It exploits the following principles: (1) remember previously visited states, (2) first return to a promising state (without exploration), then explore from it, and (3) solve simulated environments through any available means (including by introducing determinism), then robustify via imitation learning. The combined effect of these principles is a dramatic performance improvement on hard-exploration problems.

9 Reinforcement learning causally

Leariing to act clearly sounds like learning counterfactuals and interventions, right? So surely there is a causal angle?

Yes. See Causality in learning to act.

Schulte and Poupart (2024):

Reinforcement learning (RL) and causal modelling naturally complement each other. The goal of causal modelling is to predict the effects of interventions in an environment, while the goal of reinforcement learning is to select interventions that maximize the rewards the agent receives from the environment. Reinforcement learning includes the two most powerful sources of information for estimating causal relationships: temporal ordering and the ability to act on an environment. This paper examines which reinforcement learning settings we can expect to benefit from causal modelling, and how. In online learning, the agent has the ability to interact directly with their environment, and learn from exploring it. Our main argument is that in online learning, conditional probabilities are causal, and therefore offline RL is the setting where causal learning has the most potential to make a difference. Essentially, the reason is that when an agent learns from their own experience, there are no unobserved confounders that influence both the agent’s own exploratory actions and the rewards they receive. Our paper formalizes this argument. For offline RL, where an agent may and typically does learn from the experience of others, we describe previous and new methods for leveraging a causal model, including support for counterfactual queries.

10 Incoming

11 References

Ajay, Du, Gupta, et al. 2023. Is Conditional Generative Modeling All You Need for Decision-Making? In.
Bensoussan, Li, Nguyen, et al. 2020. Machine Learning and Control Theory.” arXiv:2006.05604 [Cs, Math, Stat].
Berrueta, Pinosky, and Murphey. 2024. Maximum Diffusion Reinforcement Learning.” Nature Machine Intelligence.
Black, Janner, Du, et al. 2023. Training Diffusion Models with Reinforcement Learning.” In.
Brockman, Cheung, Pettersson, et al. 2016. OpenAI Gym.” arXiv:1606.01540 [Cs].
Chen, Lu, Rajeswaran, et al. 2021. Decision Transformer: Reinforcement Learning via Sequence Modeling.” In Advances in Neural Information Processing Systems.
Clifton, and Laber. 2020. Q-Learning: Theory and Applications.” Annual Review of Statistics and Its Application.
Drori. 2022a. “Deep Reinforcement Learning.” In The Science of Deep Learning.
———. 2022b. “Reinforcement Learning.” In The Science of Deep Learning.
———. 2022c. The Science of Deep Learning.
Eberhard, Muehlebach, and Vernade. 2025. Partially Observable Reinforcement Learning with Memory Traces.”
Ecoffet, Huizinga, Lehman, et al. 2021. Go-Explore: A New Approach for Hard-Exploration Problems.”
Elliott, Urdshals, Quarel, et al. 2026. Stagewise Reinforcement Learning and the Geometry of the Regret Landscape.”
Fellows, Mahajan, Rudner, et al. 2019. VIREL: A Variational Inference Framework for Reinforcement Learning.” In Advances in Neural Information Processing Systems.
Foster, Han, Qian, et al. 2024. Online Estimation via Offline Estimation: An Information-Theoretic Framework.”
Foster, and Rakhlin. 2023. Foundations of Reinforcement Learning and Interactive Decision Making.”
Hu, Com, and Wellman. n.d. “Nash Q-Learning for General-Sum Stochastic Games.”
Jaakkola, Singh, and Jordan. 1995. Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems.” In Advances in Neural Information Processing Systems.
Kaelbling, Littman, and Moore. 1996. Reinforcement Learning: A Survey.” Journal of Artifical Intelligence Research.
Kochenderfer, Wheeler, and Wray. 2022. Algorithms for decision making.
Korbak, Perez, and Buckley. 2022. RL with KL Penalties Is Better Viewed as Bayesian Inference.”
Krakovsky. 2016. Reinforcement Renaissance.” Commun. ACM.
Krishnamurthy, Agarwal, and Langford. 2016. Contextual-MDPs for PAC-Reinforcement Learning with Rich Observations.” arXiv:1602.02722 [Cs, Stat].
Lehman, Gordon, Jain, et al. 2022. Evolution Through Large Models.”
Levine. 2018. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review.”
Lowe, Wu, Tamar, et al. 2020. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments.”
Mania, Guy, and Recht. 2018. Simple Random Search Provides a Competitive Approach to Reinforcement Learning.” arXiv:1803.07055 [Cs, Math, Stat].
Mukherjee, and Liu. 2023. Bridging Physics-Informed Neural Networks with Reinforcement Learning: Hamilton-Jacobi-Bellman Proximal Policy Optimization (HJBPPO).”
Novelli, Pratticò, Pontil, et al. 2024. Operator World Models for Reinforcement Learning.”
Oberst, and Sontag. 2019. Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models.” In Proceedings of the 36th International Conference on Machine Learning.
Parisotto, and Salakhutdinov. 2017. Neural Map: Structured Memory for Deep Reinforcement Learning.” arXiv:1702.08360 [Cs].
Pfau, and Vinyals. 2016. Connecting Generative Adversarial Networks and Actor-Critic Methods.” arXiv:1610.01945 [Cs, Stat].
Ramírez-Ruiz, Grytskyy, Mastrogiuseppe, et al. 2024. Complex Behavior from Intrinsic Motivation to Occupy Future Action-State Path Space.” Nature Communications.
Ren, Zhang, Lee, et al. 2023. Spectral Decomposition Representation for Reinforcement Learning.”
Ringstrom. 2022. Reward Is Not Necessary: How to Create a Compositional Self-Preserving Agent for Life-Long Learning.”
Salimans, Ho, Chen, et al. 2017. Evolution Strategies as a Scalable Alternative to Reinforcement Learning.” arXiv:1703.03864 [Cs, Stat].
Schulman, Moritz, Levine, et al. 2018. High-Dimensional Continuous Control Using Generalized Advantage Estimation.”
Schulman, Wolski, Dhariwal, et al. 2017. Proximal Policy Optimization Algorithms.”
Schulte, and Poupart. 2024. Why Online Reinforcement Learning Is Causal.”
Shibata, Yoshinaka, and Chikayama. 2006. Probabilistic Generalization of Simple Grammars and Its Application to Reinforcement Learning.” In Algorithmic Learning Theory. Lecture Notes in Computer Science 4264.
Silver, Singh, Precup, et al. 2021. Reward Is Enough.” Artificial Intelligence.
Sutton, Richard S, and Barto. 1998. Reinforcement Learning.
Sutton, Richard S., and Barto. 2018. Reinforcement Learning, second edition: An Introduction.
Sutton, Richard S., McAllester, Singh, et al. 2000. Policy Gradient Methods for Reinforcement Learning with Function Approximation.” In Advances in Neural Information Processing Systems.
Tang, and Munos. 2025. On a Few Pitfalls in KL Divergence Gradient Estimation for RL.”
Thrun. 1992. Efficient Exploration In Reinforcement Learning.”