Tabular data processing in Python
CSV to Data lake
November 30, 2020 — December 9, 2024
Using tabular data, i.e. the stuff you generally store in spreadsheets and relational databases, and maybe even process as structured text.
In R this means DataFrames. In Julia this means DataFrames. In Python this means pandas Tables, Xarrays, DataFrames, polars Tables, or whatever else. There are many variations in API design, backing technology and semantics, but there are two clusters. The more common one is pandas, which I do not like, and the less common one is polars, which is better suited for the workflows I have had, and I suspect, most people who aren’t doing tabular time series analytics.
1 Pandas-flavour
pandas is more-or-less a de facto standard dataframe class for Python. pandas
plus statsmodels
look a lot like R. On the minus side, this combination lacks some language features of R (e.g. regression formulae are not first-class language features). On the plus side, they lack some language misfeatures of R (the object model being a box of turds, and copy-by-value semantics and all those other annoyances).
I am not a net fan of pandas, personally. The engineering behind it is impressive, but the workflow ends up not fitting my actual problems particularly well. Fun to look at but not to touch. This might be about my workflow being idiosyncratic, or it might be because the original author had an idiosyncratic workflow, and he needed weird features that the rest of us trip over. You tell me.
In comparison to R, one crucial weakness is that R has a rich ecosystem of tools for dataframes. Python is… poorer.
Also (and I do not know if this was the true history or not) when pandas was designed, some of the design choices in R were decided to be mistakes, only to discover that in practice R’s way was better than the janky new method, or that the cool hack ended up being awkward in Python syntax. Chief among these is the obligatory labelling of rows in the table; I spend a lot of time fighting pandas’ insistence on wanting everything to be labelled, and then interpreting those arbitrary names as meaningful, and then running into frictions with grouping and merging.
This makes sense, you might think, if everything has a natural label like a timestamp. In practice, for me it does not make sense. It turns out that my time series are unlike the ones pandas was designed for, and so I end up fighting the library a lot.
Anyway, pandas is still usable and useful, especially because of the ecosystem. Many nice things are built on pandas, such as …
statsmodels, which is more-or-less a minimalist subset of standard R, but Python. Implements
- Linear regression models
- Generalized linear models
- Discrete choice models
- Robust linear models
- Many models and functions for time series analysis
- Nonparametric estimators
- A wide range of statistical tests
- etc
patsy implements a formula language for pandas
. Patsy does lots of things, but most importantly, it
- builds design matrices (i.e. it knows how to represent
z~x^2+x+y^3
as a matrix, which only sounds trivial if you haven’t tried it) - statefully preconditions data (e.g. constructs data transforms that will correctly normalize the test set as well as the training data.)
Pandas AI: The Generative AI Python Library makes pandas’s occasionally-abstruse query language a bit more natural.
import pandas as pd
from pandasai import PandasAI
# Sample DataFrame
df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})
# Instantiate a LLM
from pandasai.llm.openai import OpenAI
llm = OpenAI()
pandas_ai = PandasAI(llm)
pandas_ai.run(df, prompt='Which are the 5 happiest countries?')
pandera implements type sanity and validation.
1.1 Other backends supporting pandas API
The pandas API is popular; there are a few tools which aim to accelerate calculations by providing backends for it based on alternative data formats or parallelism needs.
-
cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF also provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.
-
Scale your pandas workflow by changing a single line of code — The modin.pandas DataFrame is an extremely light-weight parallel DataFrame. Modin transparently distributes the data and computation so that all you need to do is continue using the pandas API as you were before installing Modin. Unlike other parallel DataFrame systems, Modin is an extremely light-weight, robust DataFrame. Because it is so light-weight, Modin provides speed-ups of up to 4x on a laptop with 4 physical cores.
Koalas: pandas API on Apache Spark seems to be Modin but for Spark.
2 Polars-flavour
Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as the memory model.
- Lazy | eager execution
- Multi-threaded
- SIMD
- Query optimization
- Powerful expression API
- Hybrid Streaming (larger than RAM datasets)
- Rust | Python | NodeJS | …
Does not invent a Python-specific data format but instead leverages Apache arrow. It looks like pandas in many ways, but is not 100% compatible, see Polars for pandas users. Many of the divergences seem to me to be good actually:
pandas gives a label to each row with an index. Polars does not use an index and each row is indexed by its integer position in the table.
Polars aims to have predictable results and readable queries, as such we think an index does not help us reach that objective. We believe the semantics of a query should not change by the state of an index or a reset_index call.
My word yes. Thank you. pandas
has idiosyncratic and unintuitive behaviour around indices that I have never learned to enjoy. (SettingWithCopyWarning
etc).
OTOH, this divergence from the Python default means that the ecosystem is thinner again than the pandas ecosystem. On the other hand, some stuff looks easier than pandas, so maybe it is not too bad in practice.
The biggest stress point for people with this ecosystem is likely to be “is it hard to generate plots?” Visualization support actually seems great.
Also we can convert between polars and pandas if we need to, so in principle everything works with both. Maybe.
3 Conversion
There is a generic DataFrame
standard, which looks like it works for most simple dataframes (i.e. anything which can be encoded in a CSV and does not have too many exotic data types) and often works seamlessly without even copying the data.
-
Python users today have a number of great choices for dataframe libraries. From Pandas and cuDF to Vaex, Koalas, Modin, Ibis, and more. Combining multiple types of dataframes in a larger application or analysis workflow, or developing a library which uses dataframes as a data structure, presents a challenge though. Those libraries all have different APIs, and there is no standard way of converting one type of dataframe into another.
The purpose of this dataframe protocol (
__dataframe__
) is to enable data interchange . I.e., a way to convert one type of dataframe into another type (for example, convert a Koalas dataframe into a Pandas dataframe, or a cuDF dataframe into a Vaex dataframe).
Individual libraries are also independently concerned with integration, e.g. Pandas can talk Apache Arrow and pandas 2.0 supports it as a native backend; although if I were using Apache Arrow as a backend, I would just switch to polars and have less aggravating semantics.
4 Bareback HDF5
Sometimes plain old HDF5 is the right tool for the job. It is fast, configurable for weird file-systems, well-tested and has a pretty reasonable API. On the other hand, there is not much mathematical support in the HDF5 system. If you just want to store very large chunk of tabular data, though, might be worth it.