Figure 1

Here’s an intro to all of machine learning through a historical tale about one particular attempt to teach a machine (not a computer!) to play tic-tac-toe. Rodney Brooks, in Machine Learning Explained, introduces Donad Michie’s MENACE, a machine that plays tic-tac-toe using a box of matchboxes. Each matchbox contains a number of beads, each representing a possible move. The machine plays by randomly selecting a bead from the matchbox corresponding to the current state of the game. If it wins, it adds more beads to that matchbox; if it loses, it removes some beads. This algorithm is not optimal. But it is an interesting first start. Learning to make it optimal is a problem of reinforcement learning (RL).

1 Theory

You shouldn’t read this blog post to learn about RL. There are many fine introductions on the internet. The below introductions are great.

Nonetheless, I need to learn some more RL and fix some notation, so there is a tutorial below, comprising my own notes, if you want to read them.

2 Opponent shaping

Opponent shaping is a way to get agents to influence in multi-agent systems by using models of the other agents. This was interesting enough to break out into its own notebook; see opponent shaping.

3 Practice

4 Without reward

Ringstrom (), Ramírez-Ruiz et al. (), maybe Berrueta, Pinosky, and Murphey ()…?

5 Via diffusion

Is Conditional Generative Modeling all you need for Decision-Making? ().

6 Tutorial

6.1 Markov decision process

What RL is deisgned to “solve”. An MDP is given by (S,A,P,R,γ)

  1. State space S

    All the possible configurations the environment can be in. At time t the env “is” in some state stS.

  2. Action space A

    All the moves or controls the agent can choose. Agent picks atA when in state st.

  3. Transition dynamics P(st+1st,at)

    A (possibly stochastic) rule for how the world updates when you take action at in st. E.g. in CartPole, push left/right → new pole‑angle and cart‑position.

    For the purposes of describing the problem we write this out, but in fact if we are bothering with RL we do not actually know this. All we do is sample from it. E.g. in CartPole, you don’t predict the next angle of the pole, but you can sample it by running the simulation.

    I mean, you can predict it in CartPOle, but you are not required to, and totally black box systems are amenable to RL.

  4. Reward function R(st,at,st+1)

    The scalar “feedback” we get when you go from st to st+1 via at. High reward = good; low (or negative) = bad.

    Once again, we do not know this function, we just sample from it after taking an action E.g. in CartPole, you get a reward of +1 for every time step you keep the pole upright.

  5. Discount factor γ[0,1]

    How much we prefer “now” rewards over “later” rewards where γ=0 → we only care about immediate reward; γ1 → wecare a lot about the future.

  6. Initial state distribution ρ0(s)

    Where episodes start: you sample s0ρ0. Often a fixed start state or a uniform/random choice.

    People often seem to suppress this in the problem; I don’t think it add much complexity in practice.

  7. Horizon T (or termination condition)

    Max length of an episode (or “done” flag). If infinite, we rely on γ<1 to keep returns finite. This is often suppressed also.

The next few are definitions that we find convenient to make, dependent upon the above.

  1. Policy πθ(as)

    Our agent’s “strategy”: a mapping from state→distribution over actions. Parameterized by θ (e.g. the weights of a neural net).

  2. Trajectory τ=(s0,a0,r0,,sT)

    One full run‑through of states, actions, and rewards until done.

  3. Return Gt=k=tT1γktrk

    The total (discounted) sum of rewards from time t onward — What we’re trying to maximize on average.

  4. Objective J(θ)=Eτπθ[G0]=E[t=0T1γtrt].

We aim to pick θ to make expected return as large as possible.

Now if we knew P and R and their forms were tractable, we could just use dynamic programming to solve the MDP; let us assume we do not know them.

7 Policy gradient

Figure 2

The most basic way of learning θ is to use REINFORCE a.k.a. the score function gradient estimator.

We want to maximize the expected return J(θ)=Eτπθ[t=0T1γtrt]. Using the core function trick, we can write θπθ(τ)=πθ(τ)θlogπθ(τ), so we get θJ(θ)=πθ(τ)θlogπθ(τ)R(τ)dτ=Eτ[θlogπθ(τ)R(τ)]. Since logπθ(τ)=t=0T1logπθ(atst), it follows that θJ(θ)=Eτ[t=0T1θlogπθ(atst)Gt],Gt=k=tT1γktrk.

Monte Carlo approximation of the expectation is straightforward. In practice, we sample N full episodes {τ(i)}i=1N and approximate:

θJ(θ)1Ni=1Nt=0T1θlogπθ(at(i)st(i))Gt(i).

This yield the following gradient ascent update rule: θθ+α1Ni,tθlogπθ(at(i)st(i))Gt(i).

I got an LLM to generate me an implementation of this in PyTorch:

Policy learning
import torch
import torch.nn as nn
import torch.optim as optim


def compute_returns(rewards, gamma):
    """
    Input: list of rewards [r0, r1, ..., r_{T-1}]
    Output: torch.Tensor of discounted returns [G0, G1, ..., G_{T-1}]
    """
    returns = []
    R = 0.0
    for r in reversed(rewards):
        R = r + gamma * R
        returns.insert(0, R)
    return torch.tensor(returns, dtype=torch.float32)


def select_action(policy_net, state):
    """
    Samples an action and returns (action, log_prob).
    Assumes policy_net(state) returns a PyTorch distribution.
    """
    dist = policy_net(state)
    a = dist.sample()
    return a, dist.log_prob(a)


def train_policy(policy_net, optimizer, env, num_episodes, gamma=0.99):
    for episode in range(num_episodes):
        state = env.reset()
        log_probs = []
        rewards = []

        # 1) Generate one full episode
        done = False
        while not done:
            state_tensor = torch.tensor(state, dtype=torch.float32)
            action, logp = select_action(policy_net, state_tensor)

            next_state, r, done, _ = env.step(action.item())
            log_probs.append(logp)
            rewards.append(r)
            state = next_state

        # 2) Compute discounted returns G_t
        returns = compute_returns(rewards, gamma)  # shape [T]

        # 3) Policy loss: -(sum_t logπ(a_t|s_t) * G_t) / T
        log_probs = torch.stack(log_probs)  # shape [T]
        loss = -(log_probs * returns).mean()

        # 4) Gradient ascent step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # (Optional) logging
        total_return = sum(rewards)
        if episode % 10 == 0:
            print(f"Ep {episode:4d}  Return: {total_return:.2f}")

def train_policy(policy_net, optimizer, env, num_episodes, gamma=0.99):
    for episode in range(num_episodes):
        state = env.reset()
        log_probs = []
        rewards = []

        # 1) Generate one full episode
        done = False
        while not done:
            state_tensor = torch.tensor(state, dtype=torch.float32)
            action, logp = select_action(policy_net, state_tensor)

            next_state, r, done, _ = env.step(action.item())
            log_probs.append(logp)
            rewards.append(r)
            state = next_state

        # 2) Compute discounted returns G_t
        returns = compute_returns(rewards, gamma)  # shape [T]

        # 3) Policy loss: -(sum_t logπ(a_t|s_t) * G_t) / T
        log_probs = torch.stack(log_probs)  # shape [T]
        loss = -(log_probs * returns).mean()

        # 4) Gradient ascent step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # (Optional) logging
        total_return = sum(rewards)
        if episode % 10 == 0:
            print(f"Ep {episode:4d}  Return: {total_return:.2f}")


# Example setup
env = YourGymEnv()
policy_net = YourPolicyNet()  # returns a torch.distributions.Distribution
optimizer = optim.Adam(policy_net.parameters(), lr=1e-3)

train_policy(policy_net, optimizer, env, num_episodes=1000)

There you go! a minimum viable policy method. In practice we would be way more sophisticated than this.

7.1 Value function methods

We can decompose the sequentiial learning slightly differently, learning a value function, Q(s,a). We can think of Q(s,a) as “how good is it to take action a in state s, (and then behave optimally thereafter)?” If we know Q(s,a), the optimal policy would be trivial: always pick argmaxaQ(s,a).

This come from a Bellman equation which decomposes the rewards by working backwards to find the value of a decision now, given “more of the same” later. I recognise this from optimal control. The full return from (s,a) can be decomposed into

  1. Immediate reward r for taking a in s.
  2. Best possible future from the next state s.

Formally: Q(s,a)=E[r+γmaxaQ(s,a)|s,a].

We don’t need a model: we just need to sample (s,a)(r,s).

When we take action at in st and observe (rt,st+1), we form a “one-step bootstrap” target (or “Bellman backup”? Apparently people say that): rt+γmaxaQ(st+1,a)“new” estimatevs.Q(st,at)“old” estimate. The Time Difference (TD) error is δt=[rt+γmaxaQ(st+1,a)]Q(st,at). We then adjust Q(st,at) toward the target by a fraction α: Q(st,at)Q(st,at)+αδt.

The learning rate α balances how much weight we give to new info vs. old. γ is the discount factor we saw before.

In practice this doesn’t work and we get spuriously confident about terrible actions we choose early on, so we use an exploration strategy. The simplest one that can possibly work is ε‑greedy: (With probability ε, pick an action by some random method (“explore”). With probability 1ε, pick argmaxaQ(s,a) (“exploit”). We typically decay ε over time: start exploratory, then settle into purely greedy.)

I hate that, but since I want some compact code I’ll use it for now. Some other exploration strategies are explored in bandit problems, including Bayesian ones that are to my mind more intuitive even if they are not quite as simple.

LLM-generated conversion of that word salad into PyTorch:

Q Learning
import torch
import torch.nn as nn
import torch.optim as optim
import random

class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, action_dim)
        )
    def forward(self, state):
        # state: torch.Tensor of shape [batch, state_dim]
        return self.net(state)  # returns [batch, action_dim]

def select_action(q_net, state, eps, device):
    """
    ε‑greedy selection
    state: numpy array or list → torch.Tensor [1, state_dim]
    returns: action (int)
    """
    if random.random() < eps:
        return random.randrange(q_net.net[-1].out_features)
    state_v = torch.tensor([state], dtype=torch.float32, device=device)
    with torch.no_grad():
        q_vals = q_net(state_v)  # [1, A]
    return int(q_vals.argmax(dim=1).item())

def train_qlearning(
    env,
    q_net,
    optimizer,
    num_episodes,
    gamma=0.99,
    eps_start=1.0,
    eps_end=0.01,
    eps_decay=0.995,
    device="cpu"
):
    eps = eps_start
    for ep in range(1, num_episodes + 1):
        state = env.reset()
        done = False
        total_reward = 0.0

        while not done:
            # 1) Pick action
            a = select_action(q_net, state, eps, device)

            # 2) Step environment
            next_state, r, done, _ = env.step(a)
            total_reward += r

            # 3) Compute TD target
            state_v      = torch.tensor([state],      dtype=torch.float32, device=device)
            next_state_v = torch.tensor([next_state], dtype=torch.float32, device=device)
            q_vals       = q_net(state_v)            # [1, A]
            next_q_vals  = q_net(next_state_v)       # [1, A]
            q_sa         = q_vals[0, a]
            target       = r + (0.0 if done else gamma * next_q_vals.max().item())

            # 4) Compute loss & backprop
            loss = (q_sa - target) ** 2
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # 5) Move on
            state = next_state

        # 6) Decay epsilon
        eps = max(eps_end, eps * eps_decay)

        # Logging
        if ep % 10 == 0:
            print(f"Episode {ep:4d} | Return: {total_reward:.2f} | ε={eps:.3f}")

# Example usage:
import gym

env       = gym.make("CartPole-v1")
device    = torch.device("cuda" if torch.cuda.is_available() else "cpu")
q_net     = QNetwork(state_dim=env.observation_space.shape[0],
                     action_dim=env.action_space.n).to(device)
optimizer = optim.Adam(q_net.parameters(), lr=1e-3)

train_qlearning(env, q_net, optimizer, num_episodes=500)

7.2 Actor-critic

TBC

7.3 Regularization: PPO

TBC

7.4 Model-based RL

TBC

8 Incoming

9 References

Ajay, Du, Gupta, et al. 2023. Is Conditional Generative Modeling All You Need for Decision-Making? In.
Bensoussan, Li, Nguyen, et al. 2020. Machine Learning and Control Theory.” arXiv:2006.05604 [Cs, Math, Stat].
Berrueta, Pinosky, and Murphey. 2024. Maximum Diffusion Reinforcement Learning.” Nature Machine Intelligence.
Brockman, Cheung, Pettersson, et al. 2016. OpenAI Gym.” arXiv:1606.01540 [Cs].
Clifton, and Laber. 2020. Q-Learning: Theory and Applications.” Annual Review of Statistics and Its Application.
Dayan, and Watkins. n.d. “Reinforcement Learning.” In Encyclopedia of Cognitve Science.
Drori. 2022a. “Deep Reinforcement Learning.” In The Science of Deep Learning.
———. 2022b. “Reinforcement Learning.” In The Science of Deep Learning.
———. 2022c. The Science of Deep Learning.
Fellows, Mahajan, Rudner, et al. 2019. VIREL: A Variational Inference Framework for Reinforcement Learning.” In Advances in Neural Information Processing Systems.
Jaakkola, Singh, and Jordan. 1995. Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems.” In Advances in Neural Information Processing Systems.
Kaelbling, Littman, and Moore. 1996. Reinforcement Learning: A Survey.” Journal of Artifical Intelligence Research.
Kochenderfer, Wheeler, and Wray. 2022. Algorithms for decision making.
Korbak, Perez, and Buckley. 2022. RL with KL Penalties Is Better Viewed as Bayesian Inference.”
Krakovsky. 2016. Reinforcement Renaissance.” Commun. ACM.
Krishnamurthy, Agarwal, and Langford. 2016. Contextual-MDPs for PAC-Reinforcement Learning with Rich Observations.” arXiv:1602.02722 [Cs, Stat].
Lehman, Gordon, Jain, et al. 2022. Evolution Through Large Models.”
Levine. 2018. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review.” arXiv:1805.00909 [Cs, Stat].
Mania, Guy, and Recht. 2018. Simple Random Search Provides a Competitive Approach to Reinforcement Learning.” arXiv:1803.07055 [Cs, Math, Stat].
Mukherjee, and Liu. 2023. Bridging Physics-Informed Neural Networks with Reinforcement Learning: Hamilton-Jacobi-Bellman Proximal Policy Optimization (HJBPPO).”
Novelli, Pratticò, Pontil, et al. 2024. Operator World Models for Reinforcement Learning.”
Parisotto, and Salakhutdinov. 2017. Neural Map: Structured Memory for Deep Reinforcement Learning.” arXiv:1702.08360 [Cs].
Pfau, and Vinyals. 2016. Connecting Generative Adversarial Networks and Actor-Critic Methods.” arXiv:1610.01945 [Cs, Stat].
Ramírez-Ruiz, Grytskyy, Mastrogiuseppe, et al. 2024. Complex Behavior from Intrinsic Motivation to Occupy Future Action-State Path Space.” Nature Communications.
Ren, Zhang, Lee, et al. 2023. Spectral Decomposition Representation for Reinforcement Learning.”
Ringstrom. 2022. Reward Is Not Necessary: How to Create a Compositional Self-Preserving Agent for Life-Long Learning.”
Salimans, Ho, Chen, et al. 2017. Evolution Strategies as a Scalable Alternative to Reinforcement Learning.” arXiv:1703.03864 [Cs, Stat].
Schulman, Wolski, Dhariwal, et al. 2017. Proximal Policy Optimization Algorithms.”
Shibata, Yoshinaka, and Chikayama. 2006. Probabilistic Generalization of Simple Grammars and Its Application to Reinforcement Learning.” In Algorithmic Learning Theory. Lecture Notes in Computer Science 4264.
Silver, Singh, Precup, et al. 2021. Reward Is Enough.” Artificial Intelligence.
Sutton, Richard S, and Barto. 1998. Reinforcement Learning.
Sutton, Richard S., and Barto. 2018. Reinforcement Learning, second edition: An Introduction.
Sutton, Richard S., McAllester, Singh, et al. 2000. Policy Gradient Methods for Reinforcement Learning with Function Approximation.” In Advances in Neural Information Processing Systems.
Thrun. 1992. Efficient Exploration In Reinforcement Learning.”