Text data processing
December 16, 2014 — October 4, 2023
Getting data in a text-like format is an entry pass to a whole world of weird tools to manage and process it, from the command-line no less
1 General
Data Cleaner’s cookbook explicates dataframe processing by laundering through CSV/TSV and using command-line fu. Fz mentions various tools including CSV munger xsv.
2 Munging
Here are some popular tools, starting with classics and moving on to workalikes.
sed and awk (work in practice but I spend half my time fighting with the syntax, and these days I simply get a language model to write the scripts for me)
-
The
num-utils
are a set of programs for dealing with numbers from the Unix command line. angle-grinder: Slice and dice logs on the command line
Angle-grinder allows you to parse, aggregate, sum, average, min/max, percentile, and sort your data. You can see it, live-updating, in your terminal. Angle grinder is designed for when, for whatever reason, you don’t have your data in graphite/honeycomb/kibana/sumologic/splunk/etc. but still want to be able to do sophisticated analytics.
2.1 Visidata
VisiData is an interactive multitool for tabular data. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility which can handle millions of rows with ease.
Supports stupendous numbers of formats, including various databases.
2.2 jq
and jid
jq allows one to parse json instead of TSV. It claims to be “like sed for JSON data — you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.”
See also
2.3 xidel and other HTML parsers
- xidel: Xidel is a command line tool to download and extract data from HTML/XML pages using CSS selectors, XPath/XQuery 3.0, as well as querying JSON files or APIs (e.g. REST) using JSONiq.
See also
2.4 yq
yq
aspires to be “the jq
or sed
of yaml
files.” YAML is a superset of JSON, so I guess this gets you everything?
2.5 PowerShell
This seems like it should also be a strong suit of structured-data-processing shell PowerShell and indeed PowerShell does support JSON parsing for example.
2.6 Nushell
nushell
claims to subsume most of the others into a full shell environment which is also a data-processing environment/functional programming language. It has interesting features such as treating filesystem subfolders and nested data sets in the same paradigm and support for many data types natively.
Nu draws inspiration from projects like PowerShell, functional programming languages, and modern CLI tools. Rather than thinking of files and services as raw streams of text, Nu looks at each input as something with structure. For example, when you list the contents of a directory, what you get back is a table of rows, where each row represents an item in that directory. These values can be piped through a series of steps, in a series of commands called a ‘pipeline’.
2.7 sqlite-tools
Julia Evans points out that sqlite-utils magically converts JSON to sqlite.
2.8 pxi
pxi
has a cute nerdy introduction; It is a fast way of executing tiny JavaScript snippets over streaming data. Sometimes just examining it is enough; one can use pretty-print-json for that.
2.9 d2d
For a quick bit of data conversion with some JavaScript processing in the middle the open source web app d2d is useful.
2.10 fx
fx
is another JSON processor whose remarkable feature is a clickable interactive mode.
2.11 tab
Consider also, perhaps, tab
… “a modern text processing language that’s similar to awk in spirit,” but not similar in implementation or syntax. Highlights:
- Designed for concise one-liner aggregation and manipulation of tabular text data…
- Feature-rich enough to support even very complex queries. (Also includes a good set of mathematical operations.)
- Statically typed, type-inferred, declarative.
3 Searching
vgrep is a command-line text search that opens up matches in a text editor. ripgrep:
ripgrep is a line-oriented search tool that recursively searches your current directory for a regex pattern. By default, ripgrep will respect your .gitignore and automatically skip hidden files/directories and binary files
It also can search compressed files using the -z
option, plus other useful stuff.
- fzf, a command-line “fuzzy finder” that a few people suggested to me.
- The Silver Searcher/ag.
- Beyond grep: ack /beyondgrep/ack3: ack is a grep-like search tool optimized for source code.
- Gron, a tool for making JSON greppable.
That is a lengthy list. See Feature comparison of ack, ag, git-grep, grep and ripgrep for some navigation assistance.
4 Incoming
- HTTPie, a CURL-adjacentish command-line HTTP client for testing and debugging web APIs.
- dyff: diff for yaml.
- csvkit: if you spend a lot of time working with comma-separated values, accept no substitutes.
- miller, “Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON”
- Datamash: “GNU datamash is a command-line program which performs basic numeric, textual and statistical operations on input textual data files.”
- xq - Like jq, but for XML and XPath.