Differentiable learning of automata

2016-10-14 — 2025-06-05

Wherein the Training of Stack Machines and Random-Access Machines Is Undertaken via Differentiable Neural Formalisms, and Turing-Machine–like Controllers Are Induced From Data.

compsci
dynamical systems
machine learning
neural nets
NLP
signal processing
state space models
stochastic processes
stringology
time series

We learn stack machines, random access machines, nested hierarchical parsing machines, Turing machines, and whatever other automata-with-memory we want from data. In other words, we’re teaching computers to program themselves via a deep learning formalism.

Figure 1

This is an obvious idea — one of the original ideas from the GOFAI days. How to make it work in NNs isn’t obvious, but there are some charming toy examples. There are also arguably non-toy examples in the form of transformers which learn to parse and generate actual code.

Obviously, a hypothetical superhuman Artificial General Intelligence would be good at handling computer-science problems.

Directly learning automata isn’t the absolute hippest research area right now, though — it’s hard in general and also not very saleable. However, it’s simple enough, in some sense, to be analytically tractable. Some progress has been made. I feel that most of the hyped research that looks like differentiable computer learning sits in the slightly better-contained area of reinforcement learning where more progress can be made, or in the hot area of large LLMs which are harder to explain but solve the same kind of problems while looking different inside.

Related: grammatical inference, memory machines, computational complexity results in NNs, reasoning with LLMs. Automata are a kind of formalization of reasoning, but I think they’re more easily considered as their own thing. If we call this program synthesis, then these things sound connected. See learnable memory for a survey of the state of the art.

1 Programs as singularities

Murfet and Troiani (2025), citing (Clift and Murfet 2019, 2020; Clift, Murfet, and Wallbridge 2020; Ehrhard and Regnier 2003; Waring, n.d.) uses singular learning theory to study the geometry of the space of programs and how to learn them.

2 Differentiable Cellular Automata

See differentiable swarm automata .

3 Recursion on a latent

A small-model thread that belongs here more than with the language-model reasoners: HRM (G. Wang et al. 2025) and its stripped-down successor TRM (Jolicoeur-Martineau 2025) crack ARC-AGI, Sudoku and mazes with 7–27M-parameter nets that recurse on a fixed-size latent instead of emitting a long chain of thought. HRM’s headline was a brain-inspired two-timescale hierarchy, but ARC Prize’s ablation pinned the gains on the outer refine-and-retry loop plus heavy per-task augmentation rather than the hierarchy — and TRM then beat HRM with a plain two-layer net, which rather makes the case. This makes them siblings to the ARC-AGI Without Pretraining paper (Liao and Gu 2025). They are a cousin of test-time scaling — more compute per problem beats a bigger single shot — but transductive puzzle-solvers trained per task, learning an iterative computer rather than doing language. Whether recurse-on-a-latent crosses over to language is the open question.

4 Incoming

Figure 2: Differentiable pointers

Latent space of small turing machines

Blazek claims his neural networks implement predicate logic directly and are tractable, which would be interesting to look into (Blazek and Lin 2021, 2020; Blazek, Venkatesh, and Lin 2021).

Google-branded: Differentiable Neural Computers.

Christopher Olah’s Characteristically pedagogic intro.

Adrian Colyer’s introduction to neural Turing machines.

Andrej Karpathy’s memory machine list.

Facebook’s GTN:

GTN is an open source framework for automatic differentiation with a powerful, expressive type of graph called weighted finite-state transducers (WFSTs). Just as PyTorch provides a framework for automatic differentiation with tensors, GTN provides such a framework for WFSTs. AI researchers and engineers can use GTN to more effectively train graph-based machine learning models.

5 References

Blazek, and Lin. 2020. A Neural Network Model of Perception and Reasoning.” arXiv:2002.11319 [Cs, q-Bio].
———. 2021. Explainable Neural Networks That Simulate Reasoning.” Nature Computational Science.
Blazek, Venkatesh, and Lin. 2021. Deep Distilling: Automated Code Generation Using Explainable Deep Learning.” arXiv:2111.08275 [Cs].
Bottou. 2011. From Machine Learning to Machine Reasoning.” arXiv:1102.1808 [Cs].
Bubeck, Chandrasekaran, Eldan, et al. 2023. Sparks of Artificial General Intelligence: Early Experiments with GPT-4.”
Clark, Tafjord, and Richardson. 2020. Transformers as Soft Reasoners over Language.” In IJCAI 2020.
Clift, and Murfet. 2019. Derivatives of Turing Machines in Linear Logic.”
———. 2020. Encodings of Turing Machines in Linear Logic.” Mathematical Structures in Computer Science.
Clift, Murfet, and Wallbridge. 2020. Geometry of Program Synthesis.”
Dehghani, Gouws, Vinyals, et al. 2019. Universal Transformers.” In.
Dziri, Lu, Sclar, et al. 2023. Faith and Fate: Limits of Transformers on Compositionality.”
Ehrhard, and Regnier. 2003. The Differential Lambda-Calculus.” Theoretical Computer Science.
Ellis, Solar-Lezama, and Tenenbaum. 2016. Sampling for Bayesian Program Learning.” In Advances in Neural Information Processing Systems 29.
Garcez, and Lamb. 2020. Neurosymbolic AI: The 3rd Wave.”
Graves, Wayne, and Danihelka. 2014. Neural Turing Machines.” arXiv:1410.5401 [Cs].
Graves, Wayne, Reynolds, et al. 2016. Hybrid Computing Using a Neural Network with Dynamic External Memory.” Nature.
Grefenstette, Hermann, Suleyman, et al. 2015. Learning to Transduce with Unbounded Memory.” arXiv:1506.02516 [Cs].
Gulcehre, Chandar, Cho, et al. 2016. Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes.” arXiv:1607.00036 [Cs].
Hannun, Pratap, Kahn, et al. 2020. Differentiable Weighted Finite-State Transducers.” arXiv:2010.01003 [Cs, Stat].
Ibarz, Kurin, Papamakarios, et al. 2022. A Generalist Neural Algorithmic Learner.”
Jaitly, Sussillo, Le, et al. 2015. A Neural Transducer.” arXiv:1511.04868 [Cs].
Jolicoeur-Martineau. 2025. Less Is More: Recursive Reasoning with Tiny Networks.”
Kaiser, and Sutskever. 2015. Neural GPUs Learn Algorithms.” arXiv:1511.08228 [Cs].
Kim, and Bassett. 2022. A Neural Programming Language for the Reservoir Computer.” arXiv:2203.05032 [Cond-Mat, Physics:nlin].
Kleinberg, and Mullainathan. 2024. Language Generation in the Limit.”
Lai, Domke, and Sheldon. 2022. Variational Marginal Particle Filters.” In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics.
Lamb, Garcez, Gori, et al. 2020. Graph Neural Networks Meet Neural-Symbolic Computing: A Survey and Perspective.” In IJCAI 2020.
Lample, and Charton. 2019. Deep Learning for Symbolic Mathematics.” arXiv:1912.01412 [Cs].
Liao, and Gu. 2025. ARC-AGI Without Pretraining.”
Looks, Herreshoff, Hutchins, et al. 2017. Deep Learning with Dynamic Computation Graphs.” In Proceedings of ICLR.
Murfet, and Troiani. 2025. Programs as Singularities.”
Peng, Narayanan, and Papadimitriou. 2024. On Limitations of the Transformer Architecture.”
Pereira. 2002. Formal Grammar and Information Theory: Together Again? In Current Issues in Linguistic Theory.
Perez, and Liu. 2016. Gated End-to-End Memory Networks.” arXiv:1610.04211 [Cs, Stat].
Putzky, and Welling. 2017. Recurrent Inference Machines for Solving Inverse Problems.” arXiv:1706.04008 [Cs].
Saldyt, and Kambhampati. 2025. Mind The Gap: Deep Learning Doesn’t Learn Deeply.”
Schuurmans, Dai, and Zanini. 2024. Autoregressive Large Language Models Are Computationally Universal.”
Veličković, and Blundell. 2021. Neural Algorithmic Reasoning.” Patterns.
Wang, Junxiong, Gangavarapu, Yan, et al. 2024. MambaByte: Token-Free Selective State Space Model.”
Wang, Guan, Li, Sun, et al. 2025. Hierarchical Reasoning Model.”
Wang, Cheng, and Niepert. 2019. State-Regularized Recurrent Neural Networks.”
Waring. n.d. “Geometric Perspectives on Program Synthesis and Semantics.”
Wei, Fan, Carin, et al. 2017. An Inner-Loop Free Solution to Inverse Problems Using Deep Neural Networks.” arXiv:1709.01841 [Cs].
Weston, Chopra, and Bordes. 2014. Memory Networks.” arXiv:1410.3916 [Cs, Stat].
Wu, Tan, Wang, et al. 2024. Beyond Language Models: Byte Models Are Digital World Simulators.”
Zhang, Backurs, Bubeck, et al. 2022. Unveiling Transformers with LEGO: A Synthetic Reasoning Task.”
Zhu, Zhang, Shi, et al. 2025. The 4th Dimension for Scaling Model Size.”