Natural language processing software

Dave, although you took very thorough precautions in the pod against my hearing you, I could see your lips move.


HuggingFace distributes and documents and implements a lot of Transformer/attention NLP models and seem to be the most active neural NLP project. Certainly too active to explain what they are up to in between pumping out all the code.

This plus SpaCy (below) are the current hotness.


spaCy excels at large-scale information extraction tasks. It’s written from the ground up in carefully memory-managed Cython. Independent research has confirmed that spaCy is the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using. […]

spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.


Stanza, A Python Natural Language Processing Toolkit for Many Human Languages @QiStanza2020

Stanza is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, to give a syntactic structure dependency parse, and to recognize named entities. The toolkit is designed to be parallel among more than 70 languages, using the Universal Dependencies formalism.

Stanza is built with highly accurate neural network components that also enable efficient training and evaluation with your own annotated data. The modules are built on top of the PyTorch library. You will get much faster performance if you run this system on a GPU-enabled machine.

In addition, Stanza includes a Python interface to the CoreNLP Java package and inherits additonal functionality from there, such as constituency parsing, coreference resolution, and linguistic pattern matching.


Gensim is designed to process raw, unstructured digital texts ("plain text") using unsupervised machine learning algorithms.

The algorithms in Gensim, such as Word2Vec, FastText, Latent Semantic Indexing (LSI, LSA, LsiModel), Latent Dirichlet Allocation (LDA, LdaModel) etc, automatically discover the semantic structure of documents by examining statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary -- you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents (sentence, phrase, word...) can be succinctly expressed in the new, semantic representation and queried for topical similarity against other documents (words, phrases...).


BlingFire Fire Tokenizer is a tokenizer designed for fast-speed and quality tokenization of Natural Language text. It mostly follows the tokenization logic of NLTK, except hyphenated words are split and a few errors are fixed.

This looks like it is also good for non-NLP tokenization tasks.


Like other deep learning frameworks, there is some basic NLP support in pytorch; see pytorch.text. Is this the same as the next one?


NLTK is a classic python teaching library for rolling your own language processing.


Formerly ClearNLP.

The Natural Language Processing for JVM languages (NLP4J) project provides:

NLP tools readily available for research in various disciplines. Frameworks for fast development of efficient and robust NLP components. API for manipulating computational structures in NLP (e.g., dependency graph). The project is initiated and currently led by the Emory NLP research group with many helps [sic] from the community.


  • mate

  • corenlp

  • apache opennlp

  • MALLET is another big java NLP workbenchey thing

  • IMS Open Corpus Workbench (CWB)…

    is a collection of open-source tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations.

    I’m uncertain how actively maintained this is.

  • HTK

    The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing. HTK is in use at hundreds of sites worldwide.

There are many more, but I am stopping with the links having found the bits and pieces I need for my purposes.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.