Continual learning in neural nets
Also catastrophic forgetting, catastrophic interference, lifelong learning, …
2024-06-05 — 2026-06-22
Wherein the Challenge of Catastrophic Forgetting in Sequential Task Learning Is Examined, and Regularization, Replay, and Dynamic Architecture Approaches Are Surveyed Alongside a Bayesian Framing of the Problem.
Continual learning concerns designing algorithms that can update in the field, rather than being trained once and deployed statically. It asks: how can we keep neural networks learning without catastrophically overwriting what they already know?
1 Catastrophic Forgetting
This is notoriously tricky because of catastrophic forgetting. A network trained sequentially on multiple tasks loses performance on earlier tasks as it adapts to new ones. This was demonstrated in early connectionist models (McCloskey and Cohen 1989; French 1999).
Formally, suppose tasks \(T_1, T_2, dots, T_n\) arrive sequentially. A naive learner updates weights \(\theta\) only with respect to the most recent loss:
\[ \theta_{t+1} = \theta_t - \eta \, \nabla_\theta \, L_{T_k}(\theta_t), \]
This can move parameters far from regions that previously minimized the losses \(L_{T_i}\). The challenge is to approximate
\[ \min_{\theta} \; \sum_{i=1}^n w_i L_{T_i}(\theta), \]
Without having access to all past data simultaneously.
2 This would be easy in Bayes if Bayes were easy
I have complicated feelings about catastrophic forgetting as a concept. It’s only a problem, I could argue, because GD updates are ill-conditioned because we are not regularized by their prior uncertainty as a Bayesian neural net would be. OTOH, Bayesian neural networks are intractable and Bayes is fake in practice, so this is a weak objection.
Anyway, the argument goes as follows: In Bayesian inference, catastrophic forgetting doesn’t arise because the posterior already contains (and continues to contain) the full information of the past. If \(p(\theta \mid D_{1:k-1})\) is the posterior after tasks \(1,\dots,k-1\), then the update after new data \(D_k\) is simply:
\[ p(\theta \mid D_{1:k}) \propto p(D_k \mid \theta) \, p(\theta \mid D_{1:k-1}). \]
This is the gold-standard update rule, as far as I know. Bayes already gives us an ideal update rule for refining our estimates given new observations. But exact Bayesian neural nets are generally intractable, so practical algorithms must approximate (Khan 2025; Opper and Winther 1999; Beal 2003).
And now we are back to the difficult problem of designing tractable approximations to Bayesian updates, which often means resorting to post-hoc rationales, i.e. just designing algorithms.
Several major families of continual learning methods have accordingly been developed. Let us have a squiz at some.
3 Regularization
Elastic Weight Consolidation (EWC) (Kirkpatrick et al. 2017) slows updates along directions important to previous tasks, using the Fisher information matrix as a quadratic penalty. This looks like a Bayes-by-backprop setup to me.
4 Replay
Replay methods rehearse past examples (real or generated). This goes back to Robins (1995) and is biologically inspired by hippocampal replay during sleep. But it looks a hell of a lot like re-injecting \(p(\theta \mid D_{1:k-1})\) like a half-arsed Bayes.
5 Dynamic architectures
Dynamic architectures allocate new capacity as new tasks arrive (progressive nets, etc.). Schwarz et al. (2018) apparently combine regularization and distillation in their Progress & Compress framework.
6 Dynamic evaluation
I just learned about this sequence prediction hack from Gwern.
Dynamic evaluation or test-time finetuning is a performance-enhancing1 online machine learning technique where the ML model is trained further at runtime on ‘new’ data, eg. an RNN/Transformer is benchmarked on predicting text, but in addition to its prediction each timestep, it does an additional gradient descent on the newly-observed text. (It is analogous to short-term memory neural plasticity.)…
Dynamic evaluation is attractive because it requires no modifications to the architecture or training—it simply does more ‘training’, rather than leaving the weights frozen and relying on the hidden state (or self-attention) to do all learning, leading to greater consistency
Connection also to transformers and recurrent nets, but anti-connection to quantization, which clearly clashes with this idea.
7 Evals as training / Stability gap
A point raised by Lange, Ven, and Tuytelaars (2023) is that evaluation protocols are weird in the continual setting. Standard practice evaluates only at task boundaries, which hides a “stability gap” — a sharp mid-training dip in performance. That means the model may transiently underperform far more than checkpoint-level metrics suggest.
…I wrote that 10 months ago and I’m not sure exactly what my point was? Leaving this note here for now but this needs a sanity check.
They propose a continual evaluation framework with new metrics such as Worst‑Case Accuracy and Average Minimum Accuracy, which better capture the dynamic trajectory of learning and forgetting. This reflects a core concern that evaluation signals can themselves shape the optimization process.
