Applied string mangling

Regexes, parsing, tokenising etc

2019-12-08 — 2021-07-05

compsci

language

networks

stringology

Suspiciously similar content

A.k.a. Un-natural language processing.

1 Regexp

Figure 1: Image used under CC licence from Martin Haverbeke’s Eloquent Javascript.

A.k.a. regexes. A.k.a. “regular expressions”, from a principled origin they presumably had in the theory of syntax. However, regexes as commonly encountered encode a particular way of specifying a language, rather than some arbitrary class of regular languages.

The default flavour of string matching, available in a variety of flavours, all equally boring.

Catastrophic regex.

Because these are so ubiquitous, useful, and boring, there are a million bikeshedded tools for interactive regex design.

AutoRegex: Convert from English to RegEx with Natural Language Processing
ihateregex visualises regex and designs them interactively.
regexper visualizes regexes beautifully.
extendaclass tests and visualises regexes in PHP, python and javascript flavours.
Rubular is a Ruby-based regular expression editor.
regex101 is similar
regexr same

Comby is a parsing/search replace thing designed for code.

1.1 Handy regexes

r'\b(\w+)\s+\1\b' # duplicate words (essential for this blog)

2 Parsers

The ad hoc world of regexes not cutting it? Why not generate a parser? Since every computer language out there does this, there are a lot of options. Since regexes can already parse regular languages you are probably looking for deterministic context free language parsers. I do not have much to say, except maybe check the wikipedia list?. Why not use David Beazley’s SLY? That looks like a nice parser.