Natural language processing software

Dave, although you took very thorough precautions in the pod against my hearing you, I could see your lips move.

January 11, 2018 — December 3, 2023

grammar
language
machine learning
NLP
stringology
Figure 1

Not included here: a basic overview of the stages in a Natural Language processing pipelines, because there are enough overviews out there. The best practical introduction IMO is Vicki Boykis’ series What are Embeddings? which takes you from nothing all the way through to modern vector embedding.

There are more rigorous and/or abstract ones, but honestly I think language processing has become an engineering discipline at this point rather than an abstract mathematical one.

1 HuggingFace

HuggingFace distributes and documents and implements a lot of Transformer/attention NLP models and seem to be the most active neural NLP project. Certainly too active to explain what they are up to in between pumping out all the code.

2 SpaCy

http://spacy.io/:

spaCy excels at large-scale information extraction tasks. It’s written from the ground up in carefully memory-managed Cython. Independent research has confirmed that spaCy is the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using. […]

spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.

Features:

  • Support for 73+ languages
  • 84 trained pipelines for 25 languages
  • Multi-task learning with pretrained transformers like BERT
  • Pretrained word vectors
  • State-of-the-art speed
  • Production-ready training system
  • Linguistically-motivated tokenization
  • Components for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more
  • Easily extensible with custom components and attributes
  • Support for custom models in PyTorch, TensorFlow and other frameworks
  • Built in visualizers for syntax and NER
  • Easy model packaging, deployment and workflow management
  • Robust, rigorously evaluated accuracy

3 Stanza

Stanza, A Python Natural Language Processing Toolkit for Many Human Languages [@QiStanza2020]

Stanza is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, to give a syntactic structure dependency parse, and to recognize named entities. The toolkit is designed to be parallel among more than 70 languages, using the Universal Dependencies formalism.

Stanza is built with highly accurate neural network components that also enable efficient training and evaluation with your own annotated data. The modules are built on top of the PyTorch library. You will get much faster performance if you run this system on a GPU-enabled machine.

In addition, Stanza includes a Python interface to the CoreNLP Java package and inherits additonal functionality from there, such as constituency parsing, coreference resolution, and linguistic pattern matching.

4 Gensim

Gensim is designed to process raw, unstructured digital texts (“plain text”) using unsupervised machine learning algorithms.

The algorithms in Gensim, such as Word2Vec, FastText, Latent Semantic Indexing (LSI, LSA, LsiModel), Latent Dirichlet Allocation (LDA, LdaModel) etc, automatically discover the semantic structure of documents by examining statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary—you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents (sentence, phrase, word…) can be succinctly expressed in the new, semantic representation and queried for topical similarity against other documents (words, phrases…).

We built Gensim from scratch for:

  • Practicality – as industry experts, we focus on proven, battle-hardened algorithms to solve real industry problems. More focus on engineering, less on academia.

  • Memory independence – there is no need for the whole training corpus to reside fully in RAM at any one time. Can process large, web-scale corpora using data streaming.

  • Performance – highly optimized implementations of popular vector space algorithms using C, BLAS and memory-mapping.

5 Bling Fire

Hi, we are a team at Microsoft called Bling (Beyond Language Understanding), we help Bing be smarter. Here we wanted to share with all of you our FInite State machine and REgular expression manipulation library (FIRE). We use Fire for many linguistic operations inside Bing such as Tokenization, Multi-word expression matching, Unknown word-guessing, Stemming / Lemmatization just to mention a few.

Bling Fire Tokenizer provides state of the art performance for Natural Language text tokenization. Bling Fire supports the following tokenization algorithms:

  1. Pattern-based tokenization
  2. WordPiece tokenization
  3. SentencePiece Unigram LM
  4. SentencePiece BPE
  5. Induced/learned syllabification patterns (identifies possible hyphenation points within a token)

Bling Fire provides uniform interface for working with all four algorithms so there is no difference for the client whether to use tokenizer for XLNET, BERT or your own custom model.

Model files describe the algorithms they are built for and are loaded on demand from external file. There are also two default models for NLTK-style tokenization and sentence breaking, which does not need to be loaded. The default tokenization model follows logic of NLTK, except hyphenated words are split and a few “errors” are fixed.

6 pytorch.text

Like other deep learning frameworks, there is some basic NLP support in pytorch; see pytorch.text. Is this the same as the next one?

7 pytext

8 NLTK

NLTK is a classic python teaching library for rolling your own language processing.

9 NLP4J

Formerly ClearNLP.

The Natural Language Processing for JVM languages (NLP4J) project provides:

NLP tools readily available for research in various disciplines. Frameworks for fast development of efficient and robust NLP components. API for manipulating computational structures in NLP (e.g., dependency graph). The project is initiated and currently led by the Emory NLP research group with many helps [sic] from the community.

10 Incoming

  • mate

  • corenlp

  • apache opennlp

  • MALLET is another big java NLP workbenchey thing

  • IMS Open Corpus Workbench (CWB)…

    is a collection of open-source tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations.

    I’m uncertain how actively maintained this is.

  • HTK

    The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing. HTK is in use at hundreds of sites worldwide.

There are many more but I am stopping here, having found the bits and pieces I need for my purposes.