Tabular data processing in python

CSV to Data lake

November 30, 2020 — May 29, 2023

approximation
Bayes
clustering
high d
linear algebra
networks
optimization
probabilistic algorithms
probability
sparser than thou
statistics
Figure 1

Learning for tabular data, i.e. the stuff you generally store in spreadsheets and relational databases, and maybe even process as structured text.

In R this means DataFrames. In Julia this means DataFrames. In python this means pandas Tables Xarrays, Dataframes, polars Tables or whatever else.

1 Pandas-flavour

  • pandas is more-or-less a dataframe class for python. pandas plus statsmodels look a lot like R. On the minus side, this combination lack some language features of R (e.g. regression formulae are not first class language features). On the plus side, they lack some language misfeatures of R (the object model being a box of turds, and copy-by-value semantics and all those other annoyances.)

I am not a huge fan of pandas, personally. The engineering behind it is impressive, but the workflow ends up not fitting my actual problems particularly well. Fun to look at but not to touch. This might be about my workflow being idiosyncratic, or it might be because the original author had an idiosyncratic workflow, and it is he who needed weird features that the rest of us trip over. YMMV.

In comparison to R, one crucial weakness is that R has a rich ecosystem of tools for dataframes. Python is a bit thinner.

Also (and I do not know if this was the true process or not) when pandas was designed, Wes McKinney made a bunch of design choices differently than R did, possibly thinking to himself “Why did R not do it this way which is clearly better”, only to discover that in practice R’s way was better, or that the cool hack ended up being awkward in python syntax. Chief among these is the obligatory indexing of rows in the table; I spend a lot of time fighting pandas’ insistence on wanting everything to be named, and then interpreting those arbitrary names as meaningful.

Anyway, this is still very usable and useful.

Lots of nice things are built on pandas, such as …

statsmodels, which is more-or-less a minimalist subset of standard R, but Python. Implements

  • Linear regression models
  • Generalized linear models
  • Discrete choice models
  • Robust linear models
  • Many models and functions for time series analysis
  • Nonparametric estimators
  • A wide range of statistical tests
  • etc

patsy implements a formula language for pandas. Patsy does lots of things, but most importantly, it

  • builds design matrices (i.e. it knows how to represent z~x^2+x+y^3 as a matrix, which only sounds trivial if you haven’t tried it)
  • statefully preconditions data (e.g. constructs data transforms that will correctly normalise the test set as well as the training data.)

Pandas AI: The Generative AI Python Library makes pandas’s occasionally-abstruse query language a bit more natural.

import pandas as pd
from pandasai import PandasAI

# Sample DataFrame
df = pd.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
    "happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})

# Instantiate a LLM
from pandasai.llm.openai import OpenAI
llm = OpenAI()

pandas_ai = PandasAI(llm)
pandas_ai.run(df, prompt='Which are the 5 happiest countries?')

pandera implements type sanity and validation.

1.1 Other backends supporting pandas API

The pandas API is popular; there are a few tools which aim to accelerate calculations by providing backends for it based on alternative data formats or parallelism needs.

  • Rapids AI cuDF

    cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF also provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

  • Modin

    Scale your pandas workflow by changing a single line of code — The modin.pandas DataFrame is an extremely light-weight parallel DataFrame. Modin transparently distributes the data and computation so that all you need to do is continue using the pandas API as you were before installing Modin. Unlike other parallel DataFrame systems, Modin is an extremely light-weight, robust DataFrame. Because it is so light-weight, Modin provides speed-ups of up to 4x on a laptop with 4 physical cores.

  • Koalas: pandas API on Apache Spark seems to be Modin but for Spark.

  • Python dataframe interchange protocol

    Python users today have a number of great choices for dataframe libraries. From Pandas and cuDF to Vaex, Koalas, Modin, Ibis, and more. Combining multiple types of dataframes in a larger application or analysis workflow, or developing a library which uses dataframes as a data structure, presents a challenge though. Those libraries all have different APIs, and there is no standard way of converting one type of dataframe into another.

2 Polars-flavour

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as the memory model.

  • Lazy | eager execution
  • Multi-threaded
  • SIMD
  • Query optimization
  • Powerful expression API
  • Hybrid Streaming (larger than RAM datasets)
  • Rust | Python | NodeJS | …

Does not invent a python-specific data format but instead leverages Apache arrow. It looks like pandas in many ways, but is not 100% compatible, see Polars for pandas users. This means that the ecosystem is thinner again than the pandas ecosystem; on the other hand, some stuff looks easier than pandas, so maybe it is not too bad in practice.

3 Other