Text data processing


Getting data in a text-like format gets you a whole world of weird tools to manage and process it.

General

Data Cleaner’s cookbook explicates dataframe processing by laundering through CSV/TSV and using command-line fu. Fz mentions various tools including CSV munger xsv.

Munging

Here are some popular tools.

jq

jq allows one to parse json instead of TSV. It claims to be “like sed for JSON data — you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.”

yq

yq aspires to be “the jq or sed of yaml files.” YAML is a superset of JSON, so I guess this gets you everything?

PowerShell

This seems like it should also be a strong suit of structured-data-processing shell PowerShell and indeed Powershell does support JSON parsing for example.

Nushell

nushell claims to subsume most of the others into a full shell environment which is also a data-processing environment/functional programming language. It has interesting features such as treating filesystem subolders and nested data sets in the same paradigm and support for many data types natively.

Nu draws inspiration from projects like PowerShell, functional programming languages, and modern CLI tools. Rather than thinking of files and services as raw streams of text, Nu looks at each input as something with structure. For example, when you list the contents of a directory, what you get back is a table of rows, where each row represents an item in that directory. These values can be piped through a series of steps, in a series of commands called a ‘pipeline’.

pxi

pxi has a cure nerdy introduction; It is a fast way of executing tiny javascript snippets over streaming data. Sometimes just examining it is enough; one can use pretty-print-json for that.

d2d

For a quick bit of data conversion with some javascript processing in the middle the open source web app d2d is useful.

fx

fx is another JSON processor whose remarkable features is a clickable interactive mode.

awk

A classic unix tool for text data processing. It’s fine. Ubiquitous. But not intuitive or luxurious like a modern programming language. It does CSV very well, but I would be afraid of more structured formats such as JSON.

tab

Consider also, perhaps, tab … a modern text processing language that’s similar to awk in spirit. (But not similar in implementation or syntax.) Highlights:

  • Designed for concise one-liner aggregation and manipulation of tabular text data…
  • Feature-rich enough to support even very complex queries. (Also includes a good set of mathematical operations.)
  • Statically typed, type-inferred, declarative.

Searching

vgrep is a command-line text search that opens up matches in a text editor. ripgrep

ripgrep is a line-oriented search tool that recursively searches your current directory for a regex pattern. By default, ripgrep will respect your .gitignore and automatically skip hidden files/directories and binary files

It also can search compressed files using the -z option.

TBD: ack.