Continual learning in neural nets

Also catastrophic forgetting, catastrophic interference, lifelong learning, …

2024-06-05 — 2025-08-23

algebra
graphical models
how do science
machine learning
networks
probability
statistics
Figure 1

Continual learning is the field of designing algorithms that can update in the field, rather than being trained once and deployed statically. It asks: how can neural networks keep learning without catastrophically overwriting what they already know?

Notoriously tricky because of catastrophic forgetting.

1 Catastrophic Forgetting

The classic problem is catastrophic forgetting: when a network trained sequentially on multiple tasks loses performance on earlier tasks as it adapts to new ones. his was demonstrated in early connectionist models (McCloskey and Cohen 1989; French 1999). Robins (1995) proposed one of the earliest fixes: rehearsal (or replay) of old data to anchor memory.

Formally, suppose tasks \(T_1, T_2, dots, T_n\) arrive sequentially. A naive learner updates weights \(\theta\) only with respect to the most recent loss:

\[ \theta_{t+1} = \theta_t - \eta \, \nabla_\theta \, L_{T_k}(\theta_t), \]

which may move parameters far from regions that minimized earlier losses \(L_{T_i}\). The challenge is to approximate

\[ \min_{\theta} \; \sum_{i=1}^n w_i L_{T_i}(\theta), \]

without having access to all past data simultaneously.

2 This would be easy in Bayes if Bayes were easy

In Bayesian inference, catastrophic forgetting does not arise in principle: the posterior already contains the full information of the past. If \(p(\theta \mid D_{1:k-1})\) is the posterior after tasks \(1,\dots,k-1\), then the update after new data \(D_k\) is simply

\[ p(\theta \mid D_{1:k}) \propto p(D_k \mid \theta) \, p(\theta \mid D_{1:k-1}). \]

This is the gold standard update rule as far as I know. Bayes already tells us an ideal update rule to refine our estimates given new observations. But exact Bayesian neural nets are generally intractable, so practical algorithms must approximate (Khan 2025; Opper and Winther 1999; Beal 2003).

3 Algorithmic Strategies

Several major families of continual learning methods have been developed:

  • Regularization-based methods constrain weight drift. Elastic Weight Consolidation (EWC) (Kirkpatrick et al. 2017) slows updates along directions important to previous tasks, using the Fisher information matrix as a quadratic penalty. This looks like a Bayes-by-backprop setup to me.

  • Replay methods rehearse past examples (real or generated). This goes back to Robins (1995), and is biologically inspired by hippocampal replay during sleep.

  • Dynamic architectures allocate new capacity as new tasks arrive (progressive nets, etc.). Schwarz et al. (2018) capparently ombine regularization and distillation in their Progress & Compress framework.

4 Evals as training / Stability gap

A point raised by Lange, Ven, and Tuytelaars (2023) is that evaluation protocols are weird in the continual setting. Standard practice evaluates only at task boundaries, which hides a stability gap: a sharp mid-training dip in performance. This means the model may transiently underperform much more than checkpoint-level metrics suggest.

They propose a continual evaluation framework with new metrics such as Worst‑Case Accuracy and Average Minimum Accuracy, which better capture the dynamic trajectory of learning and forgetting. This reflects a concern principle that evaluation signals can themselves shape the optimization process.

5 Incoming

6 References

Aleixo, Colonna, Cristo, et al. 2023. Catastrophic Forgetting in Deep Learning: A Comprehensive Taxonomy.”
Beal. 2003. “Variational Algorithms for Approximate Bayesian Inference.”
Beaulieu, Frati, Miconi, et al. 2020. Learning to Continually Learn.”
Daheim, Möllenhoff, Ponti, et al. 2024. Model Merging by Uncertainty-Based Gradient Matching.”
De Lange, Aljundi, Masana, et al. 2021. A Continual Learning Survey: Defying Forgetting in Classification Tasks.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
French. 1999. Catastrophic Forgetting in Connectionist Networks.” Trends in Cognitive Sciences.
Gers, Schmidhuber, and Cummins. 2000. Learning to Forget: Continual Prediction with LSTM.” Neural Computation.
Golden, Delanois, Sanda, et al. 2022. Sleep Prevents Catastrophic Forgetting in Spiking Neural Networks by Forming a Joint Synaptic Weight Representation.” PLOS Computational Biology.
Hoffman, Blei, Wang, et al. 2013. Stochastic Variational Inference.” arXiv:1206.7051 [Cs, Stat].
Jiang, Shu, Wang, et al. 2022. Transferability in Deep Learning: A Survey.”
Khan. 2025. Knowledge Adaptation as Posterior Correction.”
Khetarpal, Riemer, Rish, et al. 2022. Towards Continual Reinforcement Learning: A Review and Perspectives.” Journal of Artificial Intelligence Research.
Kirkpatrick, Pascanu, Rabinowitz, et al. 2017. Overcoming Catastrophic Forgetting in Neural Networks.” Proceedings of the National Academy of Sciences.
Lange, Ven, and Tuytelaars. 2023. Continual Evaluation for Lifelong Learning: Identifying the Stability Gap.”
McCloskey, and Cohen. 1989. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem.” Edited by Gordon H. Bower. Psychology of Learning and Motivation.
Moreno-Muñoz, Artés-Rodríguez, and Álvarez. 2019. Continual Multi-Task Gaussian Processes.” arXiv:1911.00002 [Cs, Stat].
Nguyen, Low, and Jaillet. 2020. Variational Bayesian Unlearning.” In Advances in Neural Information Processing Systems.
Opper, and Winther. 1999. A Bayesian Approach to on-Line Learning.” In On-Line Learning in Neural Networks. Publications of the Newton Institute.
———. 2000. “Mean Field Approximations for Bayesian Classification with Gaussian Processes.” In Advances in Neural Information Processing Systems.
———. 2001. “Adaptive and Self-Averaging TAP Mean-Field Theory for Probabilistic Modeling.” Physical Review E.
Pan, Swaroop, Immer, et al. 2021. Continual Deep Learning by Functional Regularisation of Memorable Past.”
Papamarkou, Skoularidou, Palla, et al. 2024. Position Paper: Bayesian Deep Learning in the Age of Large-Scale AI.”
Robins. 1995. Catastrophic Forgetting, Rehearsal and Pseudorehearsal.” Connection Science.
Sato. 2001. Online Model Selection Based on the Variational Bayes.” Neural Computation.
Schirmer, Zhang, and Nalisnick. 2024. Test-Time Adaptation with State-Space Models.”
Schwarz, Czarnecki, Luketina, et al. 2018. Progress & Compress: A Scalable Framework for Continual Learning.” In Proceedings of the 35th International Conference on Machine Learning.
Stickgold. 2005. Sleep-Dependent Memory Consolidation.” Nature.
Williams, and Zipser. 1989. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks.” Neural Computation.