Natural language processing

Dave, although you took very thorough precautions in the pod against my hearing you, I could see your lips move.

Computation language translation, parsing, search, generation and understanding.

A mare’s nest of intersecting computational philosophical and mathematical challenges (e.g. semantics, grammatical inference, learning theory) that humans seem to be able to handle subconsciously and which we therefore hope to train machines on. Moreover it is a problem of great commercial benefit so it is likely we can muster the resources to tackle it. The interesting thing right now is the NLP explosion, where it looks like if anything has a good chance of producing artificial general intelligence it might be neural NLP, where certain architectures (especially highly evolved attention mechanisms) are producing eerily good results (Brown et al. 2020).

What is NLP?


Fun fact: the notorious openai GPT-3 is available as an API.


Stanza, A Python Natural Language Processing Toolkit for Many Human Languages Qi et al. (2020)

Stanza is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, to give a syntactic structure dependency parse, and to recognize named entities. The toolkit is designed to be parallel among more than 70 languages, using the Universal Dependencies formalism.

Stanza is built with highly accurate neural network components that also enable efficient training and evaluation with your own annotated data. The modules are built on top of the PyTorch library. You will get much faster performance if you run this system on a GPU-enabled machine.

In addition, Stanza includes a Python interface to the CoreNLP Java package and inherits additonal functionality from there, such as constituency parsing, coreference resolution, and linguistic pattern matching.


BlingFire Fire Tokenizer is a tokenizer designed for fast-speed and quality tokenization of Natural Language text. It mostly follows the tokenization logic of NLTK, except hyphenated words are split and a few errors are fixed.

This looks like it is also good for non-NLP tokenization tasks.


spaCy excels at large-scale information extraction tasks. It’s written from the ground up in carefully memory-managed Cython. Independent research has confirmed that spaCy is the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using. […]

spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.


Like other deep learning frameworks, there is some basic NLP support in pytorch; see pytorch.text.


NLTK is a classic python teaching library for rolling your own language processing.


Formerly ClearNLP.

The Natural Language Processing for JVM languages (NLP4J) project provides:

NLP tools readily available for research in various disciplines. Frameworks for fast development of efficient and robust NLP components. API for manipulating computational structures in NLP (e.g., dependency graph). The project is initiated and currently led by the Emory NLP research group with many helps [sic] from the community.

Misc other

  • mate

  • corenlp

  • apache opennlp

  • MALLET is another big java NLP workbenchey thing

  • IMS Open Corpus Workbench (CWB)…

    is a collection of open-source tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations.

    I’m uncertain how actively maintained this is.

  • HTK

    The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing. HTK is in use at hundreds of sites worldwide.

There are many more, but I am stopping with the links having found the bits and pieces I need for my purposes.

Angluin, Dana. 1988. “Identifying Languages from Stochastic Examples.” No. YALEU/DCS/RR-614.

Arisoy, Ebru, Tara N. Sainath, Brian Kingsbury, and Bhuvana Ramabhadran. 2012. “Deep Neural Network Language Models.” In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-Gram Model? On the Future of Language Modeling for HLT, 20–28. WLM ’12. Montreal, Canada: Association for Computational Linguistics.

Autebert, Jean-Michel, Jean Berstel, and Luc Boasson. 1997. “Context-Free Languages and Pushdown Automata.” In Handbook of Formal Languages, Vol. 1, edited by Grzegorz Rozenberg and Arto Salomaa, 111–74. New York, NY, USA: Springer-Verlag New York, Inc.

Baeza-Yates, Ricardo, and Berthier Ribeiro-Neto. 1999. Modern Information Retrieval. 1st ed. Addison Wesley.

Bail, Christopher Andrew. 2016. “Combining Natural Language Processing and Network Analysis to Examine How Advocacy Organizations Stimulate Conversation on Social Media.” Proceedings of the National Academy of Sciences, September, 201607151.

Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. “A Neural Probabilistic Language Model.” Journal of Machine Learning Research 3 (Feb): 1137–55.

Berstel, Jean, and Luc Boasson. 1990. “Transductions and Context-Free Languages.” In Handbook of Theoretical Computer Science, Vol. A: Algorithms and Complexity, edited by J. van Leeuwen, Albert R. Meyer, M. Nivat, Matthew Paterson, and D. Perrin, 1–278.

Blazek, Paul J., and Milo M. Lin. 2020. “A Neural Network Model of Perception and Reasoning,” February.

Booth, Taylor L, and R. A. Thompson. 1973. “Applying Probability Measures to Abstract Languages.” IEEE Transactions on Computers C-22 (5): 442–50.

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners,” June.

Charniak, Eugene. 1996. Statistical Language Learning. Reprint. A Bradford Book.

Chater, Nick, and Christopher D Manning. 2006. “Probabilistic Models of Language Processing and Acquisition.” Trends in Cognitive Sciences 10 (7): 335–44.

Cho, Kyunghyun, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. “Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation.” In EMNLP 2014.

Clark, Alexander, and Rémi Eyraud. 2005. “Identification in the Limit of Substitutable Context-Free Languages.” In Algorithmic Learning Theory, edited by Sanjay Jain, Hans Simon, and Etsuji Tomita, 3734:283–96. Lecture Notes in Computer Science. Springer Berlin / Heidelberg.

Clark, Alexander, Christophe Costa Florêncio, and Chris Watkins. 2006. “Languages as Hyperplanes: Grammatical Inference with String Kernels.” In Machine Learning: ECML 2006, edited by Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, 90–101. Lecture Notes in Computer Science 4212. Springer Berlin Heidelberg.

Clark, Alexander, Christophe Costa Florêncio, Chris Watkins, and Mariette Serayet. 2006. “Planar Languages and Learnability.” In Grammatical Inference: Algorithms and Applications, edited by Yasubumi Sakakibara, Satoshi Kobayashi, Kengo Sato, Tetsuro Nishino, and Etsuji Tomita, 148–60. Lecture Notes in Computer Science 4201. Springer Berlin Heidelberg.

Clark, Peter, Oyvind Tafjord, and Kyle Richardson. 2020. “Transformers as Soft Reasoners over Language.” In IJCAI 2020.

Collins, Michael, and Nigel Duffy. 2002. “Convolution Kernels for Natural Language.” In Advances in Neural Information Processing Systems 14, edited by T. G. Dietterich, S. Becker, and Z. Ghahramani, 625–32. MIT Press.

Gold, E Mark. 1967. “Language Identification in the Limit.” Information and Control 10 (5): 447–74.

Gonzalez, R. C., and M. G. Thomason. 1978. Syntactic Pattern Recognition: An Introduction. Addison Wesley Publishing Company.

Grefenstette, Edward, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. 2015. “Learning to Transduce with Unbounded Memory,” June.

Greibach, Sheila A. 1966. “The Unsolvability of the Recognition of Linear Context-Free Languages.” J. ACM 13 (4): 582–87.

Hopcroft, John E., and Jeffrey D. Ullman. 1979. Introduction to Automata Theory, Languages and Computation. 1st ed. Addison-Wesley Publishing Company.

Kontorovich, Leonid (Aryeh), Corinna Cortes, and Mehryar Mohri. 2008. “Kernel Methods for Learning Languages.” Theoretical Computer Science, Algorithmic Learning Theory, 405 (3): 223–36.

Kontorovich, Leonid, Corinna Cortes, and Mehryar Mohri. 2006. “Learning Linearly Separable Languages.” In Algorithmic Learning Theory, edited by José L. Balcázar, Philip M. Long, and Frank Stephan, 288–303. Lecture Notes in Computer Science 4264. Springer Berlin Heidelberg.

Lafferty, John D., Andrew McCallum, and Fernando C. N. Pereira. 2001. “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.” In Proceedings of the Eighteenth International Conference on Machine Learning, 282–89. ICML ’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Lamb, Luis C., Artur Garcez, Marco Gori, Marcelo Prates, Pedro Avelar, and Moshe Vardi. 2020. “Graph Neural Networks Meet Neural-Symbolic Computing: A Survey and Perspective.” In IJCAI 2020.

Lipton, Zachary C., John Berkowitz, and Charles Elkan. 2015. “A Critical Review of Recurrent Neural Networks for Sequence Learning,” May.

Manning, Christopher D. 2002. “Probabilistic Syntax.” In Probabilistic Linguistics, 289–341. Cambridge, MA: MIT Press.

Manning, Christopher D, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.

Manning, Christopher D, and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, Mass.: MIT Press.

Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. 2013. “Exploiting Similarities Among Languages for Machine Translation,” September.

Mikolov, Tomáš, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. “Recurrent Neural Network Based Language Model.” In Eleventh Annual Conference of the International Speech Communication Association.

Mitra, Bhaskar, and Nick Craswell. 2017. “Neural Models for Information Retrieval,” May.

Mohri, Mehryar, Fernando Pereira, and Michael Riley. 1996. “Weighted Automata in Text and Speech Processing.” In Proceedings of the 12th Biennial European Conference on Artificial Intelligence (ECAI-96), Workshop on Extended Finite State Models of Language. Budapest, Hungary: John Wiley and Sons, Chichester.

———. 2002. “Weighted Finite-State Transducers in Speech Recognition.” Computer Speech & Language 16 (1): 69–88.

O’Donnell, Timothy J., Joshua B. Tenenbaum, and Noah D. Goodman. 2009. “Fragment Grammars: Exploring Computation and Reuse in Language,” March.

Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. “GloVe: Global Vectors for Word Representation.” Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014) 12.

Petersson, Karl-Magnus, Vasiliki Folia, and Peter Hagoort. 2012. “What Artificial Grammar Learning Reveals About the Neurobiology of Syntax.” Brain and Language, The Neurobiology of Syntax, 120 (2): 83–95.

Qi, Peng, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. “Stanza: A Python Natural Language Processing Toolkit for Many Human Languages,” March.

Rijsbergen, C. J. van. 1979. Information Retrieval. 2nd ed. Butterworth-Heinemann.

Salakhutdinov, Ruslan. 2015. “Learning Deep Generative Models.” Annual Review of Statistics and Its Application 2 (1): 361–85.

Solan, Zach, David Horn, Eytan Ruppin, and Shimon Edelman. 2005. “Unsupervised Learning of Natural Languages.” Proceedings of the National Academy of Sciences of the United States of America 102 (33): 11629–34.

Sutton, Charles, Andrew McCallum, and Khashayar Rohanimanesh. 2007. “Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data.” Journal of Machine Learning Research 8 (May): 693–723.

Wetherell, C. S. 1980. “Probabilistic Languages: A Review and Some Open Questions.” ACM Comput. Surv. 12 (4): 361–79.

Wolff, J Gerard. 2000. “Syntax, Parsing and Production of Natural Language in a Framework of Information Compression by Multiple Alignment, Unification and Search.” Journal of Universal Computer Science 6 (8): 781–829.