Applied string mangling

Regexes, parsing, tokenising etc

December 9, 2019 — July 5, 2021


A.k.a. Un-natural language processing.

1 Regexp

Figure 1: Image used under CC licence from Martin Haverbeke’s Eloquent Javascript.

A.k.a. regexes. A.k.a. “regular expressions”, from a principled origin they presumably had in the theory of syntax. However, regexes as commonly encountered encode a particular way of specifying a language, rather than some arbitrary class of regular languages.

The default flavour of string matching, available in a variety of flavours, all equally boring.

Catastrophic regex.

Because these are so ubiquitous, and useful, and boring, there are a million bikeshedded tools for interactive regex design.

Comby is parsing/search replace thing designed for code.

1.1 Handy regexes

r'\b(\w+)\s+\1\b' # duplicate words (essential for this blog)

2 Parsers

The ad hoc world of regexes not cutting it? Why not generate a parser? Since every computer language out there does this, there are a lot of options. Since regexes can already parse regular languages you are probably looking for deterministic context free language parsres. I do not have much to say, except maybe check the wikipedia list?. Why not use David Beazley’s SLY? that looks like a nice parser.