Continual learning in neural nets

Also catastrophic forgetting, catastrophic interference, lifelong learning, …

2024-06-05 — 2026-06-22

Wherein the Challenge of Catastrophic Forgetting in Sequential Task Learning Is Examined, and Regularization, Replay, and Dynamic Architecture Approaches Are Surveyed Alongside a Bayesian Framing of the Problem.

algebra
graphical models
how do science
machine learning
networks
probability
statistics
Figure 1

Continual learning concerns designing algorithms that can update in the field, rather than being trained once and deployed statically. It asks: how can we keep neural networks learning without catastrophically overwriting what they already know?

1 Catastrophic Forgetting

This is notoriously tricky because of catastrophic forgetting. A network trained sequentially on multiple tasks loses performance on earlier tasks as it adapts to new ones. This was demonstrated in early connectionist models (McCloskey and Cohen 1989; French 1999).

Formally, suppose tasks \(T_1, T_2, dots, T_n\) arrive sequentially. A naive learner updates weights \(\theta\) only with respect to the most recent loss:

\[ \theta_{t+1} = \theta_t - \eta \, \nabla_\theta \, L_{T_k}(\theta_t), \]

This can move parameters far from regions that previously minimized the losses \(L_{T_i}\). The challenge is to approximate

\[ \min_{\theta} \; \sum_{i=1}^n w_i L_{T_i}(\theta), \]

Without having access to all past data simultaneously.

2 This would be easy in Bayes if Bayes were easy

I have complicated feelings about catastrophic forgetting as a concept. It’s only a problem, I could argue, because GD updates are ill-conditioned because we are not regularized by their prior uncertainty as a Bayesian neural net would be. OTOH, Bayesian neural networks are intractable and Bayes is fake in practice, so this is a weak objection.

Anyway, the argument goes as follows: In Bayesian inference, catastrophic forgetting doesn’t arise because the posterior already contains (and continues to contain) the full information of the past. If \(p(\theta \mid D_{1:k-1})\) is the posterior after tasks \(1,\dots,k-1\), then the update after new data \(D_k\) is simply:

\[ p(\theta \mid D_{1:k}) \propto p(D_k \mid \theta) \, p(\theta \mid D_{1:k-1}). \]

This is the gold-standard update rule, as far as I know. Bayes already gives us an ideal update rule for refining our estimates given new observations. But exact Bayesian neural nets are generally intractable, so practical algorithms must approximate (Khan 2025; Opper and Winther 1999; Beal 2003).

And now we are back to the difficult problem of designing tractable approximations to Bayesian updates, which often means resorting to post-hoc rationales, i.e. just designing algorithms.

Several major families of continual learning methods have accordingly been developed. Let us have a squiz at some.

3 Regularization

Elastic Weight Consolidation (EWC) (Kirkpatrick et al. 2017) slows updates along directions important to previous tasks, using the Fisher information matrix as a quadratic penalty. This looks like a Bayes-by-backprop setup to me.

4 Replay

Replay methods rehearse past examples (real or generated). This goes back to Robins (1995) and is biologically inspired by hippocampal replay during sleep. But it looks a hell of a lot like re-injecting \(p(\theta \mid D_{1:k-1})\) like a half-arsed Bayes.

5 Dynamic architectures

Dynamic architectures allocate new capacity as new tasks arrive (progressive nets, etc.). Schwarz et al. (2018) apparently combine regularization and distillation in their Progress & Compress framework.

6 Dynamic evaluation

I just learned about this sequence prediction hack from Gwern.

Dynamic evaluation or test-time finetuning is a performance-enhancing1 online machine learning technique where the ML model is trained further at runtime on ‘new’ data, eg. an RNN/Transformer is benchmarked on predicting text, but in addition to its prediction each timestep, it does an additional gradient descent on the newly-observed text. (It is analogous to short-term memory neural plasticity.)…

Dynamic evaluation is attractive because it requires no modifications to the architecture or training—it simply does more ‘training’, rather than leaving the weights frozen and relying on the hidden state (or self-attention) to do all learning, leading to greater consistency

Connection also to transformers and recurrent nets, but anti-connection to quantization, which clearly clashes with this idea.

7 Evals as training / Stability gap

A point raised by Lange, Ven, and Tuytelaars (2023) is that evaluation protocols are weird in the continual setting. Standard practice evaluates only at task boundaries, which hides a “stability gap” — a sharp mid-training dip in performance. That means the model may transiently underperform far more than checkpoint-level metrics suggest.

…I wrote that 10 months ago and I’m not sure exactly what my point was? Leaving this note here for now but this needs a sanity check.

They propose a continual evaluation framework with new metrics such as Worst‑Case Accuracy and Average Minimum Accuracy, which better capture the dynamic trajectory of learning and forgetting. This reflects a core concern that evaluation signals can themselves shape the optimization process.

8 Incoming

9 References

Aleixo, Colonna, Cristo, et al. 2023. Catastrophic Forgetting in Deep Learning: A Comprehensive Taxonomy.”
Beal. 2003. “Variational Algorithms for Approximate Bayesian Inference.”
Beaulieu, Frati, Miconi, et al. 2020. Learning to Continually Learn.”
Daheim, Möllenhoff, Ponti, et al. 2024. Model Merging by Uncertainty-Based Gradient Matching.”
De Lange, Aljundi, Masana, et al. 2021. A Continual Learning Survey: Defying Forgetting in Classification Tasks.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
French. 1999. Catastrophic Forgetting in Connectionist Networks.” Trends in Cognitive Sciences.
Gers, Schmidhuber, and Cummins. 2000. Learning to Forget: Continual Prediction with LSTM.” Neural Computation.
Golden, Delanois, Sanda, et al. 2022. Sleep Prevents Catastrophic Forgetting in Spiking Neural Networks by Forming a Joint Synaptic Weight Representation.” PLOS Computational Biology.
Hoffman, Blei, Wang, et al. 2013. Stochastic Variational Inference.” arXiv:1206.7051 [Cs, Stat].
Jiang, Shu, Wang, et al. 2022. Transferability in Deep Learning: A Survey.”
Khan. 2025. Knowledge Adaptation as Posterior Correction.”
Khetarpal, Riemer, Rish, et al. 2022. Towards Continual Reinforcement Learning: A Review and Perspectives.” Journal of Artificial Intelligence Research.
Kirkpatrick, Pascanu, Rabinowitz, et al. 2017. Overcoming Catastrophic Forgetting in Neural Networks.” Proceedings of the National Academy of Sciences.
Lange, Ven, and Tuytelaars. 2023. Continual Evaluation for Lifelong Learning: Identifying the Stability Gap.”
McCloskey, and Cohen. 1989. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem.” Edited by Gordon H. Bower. Psychology of Learning and Motivation.
Mikolov, Karafiát, Burget, et al. 2010. Recurrent Neural Network Based Language Model.” In Eleventh Annual Conference of the International Speech Communication Association.
Moreno-Muñoz, Artés-Rodríguez, and Álvarez. 2019. Continual Multi-Task Gaussian Processes.” arXiv:1911.00002 [Cs, Stat].
Nguyen, Low, and Jaillet. 2020. Variational Bayesian Unlearning.” In Advances in Neural Information Processing Systems.
Opper, and Winther. 1999. A Bayesian Approach to on-Line Learning.” In On-Line Learning in Neural Networks. Publications of the Newton Institute.
———. 2000. “Mean Field Approximations for Bayesian Classification with Gaussian Processes.” In Advances in Neural Information Processing Systems.
———. 2001. Adaptive and Self-Averaging Thouless-Anderson-Palmer Mean-Field Theory for Probabilistic Modeling.” Physical Review E.
Pan, Swaroop, Immer, et al. 2021. Continual Deep Learning by Functional Regularisation of Memorable Past.”
Papamarkou, Skoularidou, Palla, et al. 2024. Position Paper: Bayesian Deep Learning in the Age of Large-Scale AI.”
Rannen-Triki, Bornschein, Pascanu, et al. 2024. Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models.”
Robins. 1995. Catastrophic Forgetting, Rehearsal and Pseudorehearsal.” Connection Science.
Sato. 2001. Online Model Selection Based on the Variational Bayes.” Neural Computation.
Schirmer, Zhang, and Nalisnick. 2024. Test-Time Adaptation with State-Space Models.”
Schwarz, Czarnecki, Luketina, et al. 2018. Progress & Compress: A Scalable Framework for Continual Learning.” In Proceedings of the 35th International Conference on Machine Learning.
Stickgold. 2005. Sleep-Dependent Memory Consolidation.” Nature.
Williams, and Zipser. 1989. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks.” Neural Computation.