Text data processing

2014-12-15 — 2025-04-02

Wherein the peculiar arts of text data processing are surveyed, and command-line contrivances for CSV, JSON and LLM‑assisted scripting are evoked, including jq, xsv, llm and ttok.

computers are awful

data sets

plain text

stringology

Getting data in a text-like format is an entry pass to a whole world of weird tools to manage and process it, from the command-line no less.

1 General

Data Cleaner’s cookbook explicates dataframe processing by laundering through CSV/TSV and using command-line fu. Fz mentions various tools including CSV munger xsv.

2 LLM mode

llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs

The idea with these tools is to support working with language model prompts using Unix pipes.

llm—a command-line tool for sending prompts to the OpenAI APIs, outputting the response and logging the results to a SQLite database. I introduced that a few weeks ago.

ttok—a tool for counting and truncating text based on tokens

strip-tags—a tool for stripping HTML tags from text, and optionally outputting a subset of the page based on CSS selectors

3 Munging

Here are some popular tools, starting with classics and moving on to workalikes.

sed and awk work in practice. Their great virtues are ubiquity, speed, and battle-testedness. Nonetheless I never use them. I spend half my time fighting with the syntax — these days, I simply get a language model to write the scripts for me. Although even that is not so easy because escaping regexes for use in the shell can defeat even frontier models. This suggests to me that these tools are not actually that great for humans to use, and I should be looking for something else.
num-utils:

The num-utils are a set of programs for dealing with numbers from the Unix command line.
sd, “Intuitive find & replace CLI (sed alternative)”
angle-grinder: Slice and dice logs on the command line

Angle-grinder allows you to parse, aggregate, sum, average, min/max, percentile, and sort your data. You can see it, live-updating, in your terminal. Angle-grinder is designed for when you don’t have your data in graphite/honeycomb/kibana/sumologic/splunk/etc. but still want to do sophisticated analytics.

3.1 Visidata

VisiData is an interactive multitool for tabular data. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility which can handle millions of rows with ease.

Supports stupendous numbers of formats, including various databases.

3.2 `jq` and `jid`

jq allows parsing of JSON instead of TSV. It claims to be “like sed for JSON data — you can use it to slice, filter, map, and transform structured data as easily as sed, awk, grep, and friends manipulate text.”

3.3 xidel and other HTML parsers

xidel: Xidel is a command-line tool to download and extract data from HTML/XML pages using CSS selectors, XPath/XQuery 3.0, as well as querying JSON files or APIs (e.g. REST) using JSONiq.

See also: - tq: Perform a lookup by CSS selector on an HTML input. - hq: Powerful command-line tool for handling HTML data.

3.4 `yq`

yq aspires to be “the jq or sed of yaml files.” YAML is a superset of JSON, so I guess this gets you everything?

3.5 PowerShell

Structured text process feels like it should be a strong suit of the structured-data-processing shell PowerShell, and indeed, PowerShell supports JSON parsing.

3.6 Nushell

nushell claims to subsume most of the others into a full shell environment which is also a data-processing environment/functional programming language. It has interesting features such as treating filesystem subfolders and nested data sets in the same paradigm and support for many data types natively.

Nu draws inspiration from projects like PowerShell, functional programming languages, and modern CLI tools. Rather than thinking of files and services as raw streams of text, Nu treats each input as structured data. For example, when you list the contents of a directory, what you get back is a table of rows representing items in that directory. These values can be piped through various commands, forming a ‘pipeline’.

3.7 sqlite-tools

Julia Evans points out that sqlite-utils magically converts JSON to sqlite.

3.8 `pxi`

pxi has a cute nerdy introduction; It is a fast way of executing tiny JavaScript snippets over streaming data. Sometimes just examining it is enough; one can use pretty-print-json for that.

3.9 `d2d`

For quick data conversion with some JavaScript processing in the middle, the open source web app d2d is useful.

3.10 `fx`

fx is another JSON processor whose remarkable feature is a clickable interactive mode.

3.11 `tab`

Consider, perhaps, tab … “a modern text processing language that’s similar to awk in spirit,” but different in implementation and syntax. Highlights:

Designed for concise aggregation and manipulation of tabular text data…
Supports even very complex queries. (Also includes a good set of mathematical operations.)
Statically typed, type-inferred, declarative.

4 Searching

vgrep is a command-line text search that opens up matches in a text editor.
ripgrep:

ripgrep is a line-oriented search tool that recursively searches your current directory for a regex pattern. By default, ripgrep respects your .gitignore and automatically skips hidden files/directories and binary files.

It also can search compressed files using the -z option, plus other useful stuff.
fzf, a command-line “fuzzy finder” that a few people suggested to me. I think this is the hippest option. Includes thoughtful features like fzf: Ripgrep integration, a walkthrough
Beyond grep: ack /beyondgrep/ack3: ack is a grep-like search tool optimized for source code.
The Silver Searcher/ag. I think this is no longer so active.
Gron, a tool for making JSON greppable.

That is a lengthy list. See Feature comparison of ack, ag, git-grep, grep and ripgrep for some navigation assistance.

5 Incoming

HTTPie, a CURL-adjacentish command-line HTTP client for testing and debugging web APIs.
dyff: diff for yaml.
csvkit: if you spend a lot of time working with comma-separated values, accept no substitutes.
miller, “Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON”
Datamash: “GNU datamash is a command-line program which performs basic numeric, textual and statistical operations on input textual data files.”
xq - Like jq, but for XML and XPath.