Continual learning in neural nets
Also catastrophic forgetting, catastrophic interference, lifelong learning, …
2024-06-05 — 2025-08-23
Suspiciously similar content
Continual learning is the field of designing algorithms that can update in the field, rather than being trained once and deployed statically. It asks: how can neural networks keep learning without catastrophically overwriting what they already know?
Notoriously tricky because of catastrophic forgetting.
1 Catastrophic Forgetting
The classic problem is catastrophic forgetting: when a network trained sequentially on multiple tasks loses performance on earlier tasks as it adapts to new ones. his was demonstrated in early connectionist models (McCloskey and Cohen 1989; French 1999). Robins (1995) proposed one of the earliest fixes: rehearsal (or replay) of old data to anchor memory.
Formally, suppose tasks \(T_1, T_2, dots, T_n\) arrive sequentially. A naive learner updates weights \(\theta\) only with respect to the most recent loss:
\[ \theta_{t+1} = \theta_t - \eta \, \nabla_\theta \, L_{T_k}(\theta_t), \]
which may move parameters far from regions that minimized earlier losses \(L_{T_i}\). The challenge is to approximate
\[ \min_{\theta} \; \sum_{i=1}^n w_i L_{T_i}(\theta), \]
without having access to all past data simultaneously.
2 This would be easy in Bayes if Bayes were easy
In Bayesian inference, catastrophic forgetting does not arise in principle: the posterior already contains the full information of the past. If \(p(\theta \mid D_{1:k-1})\) is the posterior after tasks \(1,\dots,k-1\), then the update after new data \(D_k\) is simply
\[ p(\theta \mid D_{1:k}) \propto p(D_k \mid \theta) \, p(\theta \mid D_{1:k-1}). \]
This is the gold standard update rule as far as I know. Bayes already tells us an ideal update rule to refine our estimates given new observations. But exact Bayesian neural nets are generally intractable, so practical algorithms must approximate (Khan 2025; Opper and Winther 1999; Beal 2003).
3 Algorithmic Strategies
Several major families of continual learning methods have been developed:
Regularization-based methods constrain weight drift. Elastic Weight Consolidation (EWC) (Kirkpatrick et al. 2017) slows updates along directions important to previous tasks, using the Fisher information matrix as a quadratic penalty. This looks like a Bayes-by-backprop setup to me.
Replay methods rehearse past examples (real or generated). This goes back to Robins (1995), and is biologically inspired by hippocampal replay during sleep.
Dynamic architectures allocate new capacity as new tasks arrive (progressive nets, etc.). Schwarz et al. (2018) capparently ombine regularization and distillation in their Progress & Compress framework.
4 Evals as training / Stability gap
A point raised by Lange, Ven, and Tuytelaars (2023) is that evaluation protocols are weird in the continual setting. Standard practice evaluates only at task boundaries, which hides a stability gap: a sharp mid-training dip in performance. This means the model may transiently underperform much more than checkpoint-level metrics suggest.
They propose a continual evaluation framework with new metrics such as Worst‑Case Accuracy and Average Minimum Accuracy, which better capture the dynamic trajectory of learning and forgetting. This reflects a concern principle that evaluation signals can themselves shape the optimization process.