Text data processing

Getting data in a text-like format gets you a whole world of weird tools to manage and process it.

Data Cleaner’s cookbook explicates dataframe processing by laundering through CSV/TSV and using command-line fu. Fz mentions various tools including CSV munger xsv.

Some of the popular ones are jq allowing you to parse json instead of TSV. It claims to be “like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.” pxi has a cure nerdy introduction; It is a fast way of executing tony jacascript snippets over streaming data. Sometimes just examining it is enough; one can use pretty-print-json for that.

For a quick bit of data conversion with some javascript processing in the middle the open source web app d2d is useful.

yq aspires to be “the jq or sed of yaml files.” YAML is a superset of JSON, so I guess this gets you everything?

fx is another JSON processor whose remarkable features is a clickable interactive mode.

Consider also, perhaps, tab … a modern text processing language that’s similar to awk in spirit. (But not similar in implementation or syntax.) Highlights:

  • Designed for concise one-liner aggregation and manipulation of tabular text data…
  • Feature-rich enough to support even very complex queries. (Also includes a good set of mathematical operations.)
  • Statically typed, type-inferred, declarative.

This seems like it should also be a strong suit of structured-data-processing shell PowerShell and indeed it does support JSON parsing for example I have not explored further however.

confbase-scheme is a tool to infer schemas for semi-structured data.