Statistics and ML in python



Statistical consultants for Python

Kinda hate R, because, as much as it is a statistical dream, it is a programming nightmare? Is MATLAB too expensive when you try to run it on your cloud server farm and you’re anyway vaguely suspicious that they get kickbacks from the companies that sell RAM because otherwise why does it eat all your memory like that? Love the speed of C++ but have a nagging feeling that you should not need to recompile your code to do exploratory data analysis? Like the idea of Julia, but wary of depending on yet another bloody language, let alone one without the serious corporate backing or long history of the other ones I mentioned?

Python has a different set of warts to those other options. Its statistical library support is narrower than R - probably comparable to MATLAB. It is, however, sorta fast enough in practice, and nicer to debug, and support diverse general programming tasks well — web servers, academic blogs, neural networks, weird art projects, online open workbooks, and has interfaces to an impressive numerical ecosystem

Although it is occasionally rough, it’s ubiquitous and free, free, free so you don’t need to worry about stupid licensing restrictions, and the community is enormous, so it’s pretty easy to answer any questions you may have.

But in any case, you don’t need to choose. Python interoperates with all these other languages, and indeed, makes a specialty of gluing stuff together.

Aside: A lot of useful machine-learning-type functionality, which I won’t discuss in detail here, exists in the python deep learning toolkits such as Tensorflow and Theano; you might want to check those pages too. Also graphing is a whole separate issue, as is optimisation.

In recent times, a few major styles have been ascendant in the statistical python scene.

DataFrame-style ecosystem

pandas plus statsmodels look a lot more like R. On the minus side, they lack some language features of R (e.g. regression formulae are not first class language features). On the plus side, they lack some language features of R (the object model being a box of turds, and copy-by-value semantics and all those other annoyances.)

The real clincher is that the supporting libraries are weaker, e.g. python does not have the lavish ggplot2 ecosystem..

  • pandas is more-or-less a dataframe class for python. Lots of nice things are built on this, such as …

  • statsmodels, which is more-or-less R, but Python. Implements

    • Linear regression models
    • Generalized linear models
    • Discrete choice models
    • Robust linear models
    • Many models and functions for time series analysis
    • Nonparametric estimators
    • A wide range of statistical tests
    • etc
  • patsy implements a formula language for pandas. This does lots of things, but most importantly, it

    • builds design matrices (i.e. it knows how to represent z~x^2+x+y^3 as a matrix, which only sounds trivial if you haven’t tried it)
    • statefully preconditions data (e.g. constructs data transforms that will correctly normalise the test set as well as the training data.)
  • pandera implements type sanity and validation

The pandas API is popular; there are a few tools which aim to accelerate calculations by providing backends for it based on alternative data formats or parallelism.

  • Rapids AI cuDF

    cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF also provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

  • Modin

    Scale your pandas workflow by changing a single line of code — The modin.pandas DataFrame is an extremely light-weight parallel DataFrame. Modin transparently distributes the data and computation so that all you need to do is continue using the pandas API as you were before installing Modin. Unlike other parallel DataFrame systems, Modin is an extremely light-weight, robust DataFrame. Because it is so light-weight, Modin provides speed-ups of up to 4x on a laptop with 4 physical cores.

  • Koalas: pandas API on Apache Spark seems to be Modin but for Spark.

  • Python dataframe interchange protocol — Python dataframe interchange protocol 2021-DRAFT documentation

    Python users today have a number of great choices for dataframe libraries. From Pandas and cuDF to Vaex, Koalas, Modin, Ibis, and more. Combining multiple types of dataframes in a larger application or analysis workflow, or developing a library which uses dataframes as a data structure, presents a challenge though. Those libraries all have different APIs, and there is no standard way of converting one type of dataframe into another.

Matrix-style ecosystem

scikit-learn exemplifies a machine-learning style, with lots of abstract feature construction and predictive-performance style model selection built around homogeneously-typed (only floats, only ints) matrices instead of dataframes. This style wil be more familiar to MATLAB users than to R users.

  • scikit-learn (sklearn to its friends) is the flagship of this fleet. It is fast, clear and well-designed. I enjoy using it for implementing ML-type tasks. It has various algorithms such as random forests and linear regression and Gaussian processes and reference implementations of many algorithms, both à la mode and passé. Although I miss sniff glmnet in R for lasso regression.

  • SciKit-Learn Laboratory

    SKLL (pronounced “skull”) provides a number of utilities to make it simpler to run common scikit-learn experiments with pre-generated features.

  • Sklearn-pandas

    …provides a bridge between sklearn’s machine learning methods and pandas-style Data Frames.

    In particular, it provides:

    • a way to map DataFrame columns to transformations, which are later recombined into features

    • a way to cross-validate a pipeline that takes a pandas DataFrame as input.

  • libROSA, the machine listening library is more or less in this school.

  • pystruct aims at being an easy-to-use structured learning and prediction library.”

    Currently it implements only max-margin methods and a perceptron, but other algorithms might follow. The learning algorithms implemented in PyStruct have various names, which are often used loosely or differently in different communities. Common names are conditional random fields (CRFs), maximum-margin Markov random fields (M3N) or structural support vector machines.

  • PyCaret (announcement) is some kind of low-code stats tool.

    PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, Microsoft LightGBM, spaCy, and many more.

Bayes stuff

PyStan

xarray

xarray; what is this?

Notable application, ArviZ: Exploratory analysis of Bayesian models as seen in

Time series

Forecasting in python. Rob J Hyndman , in Python implementations of time series forecasting and anomaly detection recommends

See tslearn

tslearn is a Python package that provides machine learning tools for the analysis of time series. This package builds on (and hence depends on) scikit-learn, numpy and scipy libraries.

It integrates with other time series tools, for example:

tsfresh

It automatically calculates a large number of time series characteristics, the so called features. Further the package contains methods to evaluate the explaining power and importance of such characteristics for regression or classification tasks.

cesium

Cesium is an end-to-end machine learning platform for time-series, from calculation of features to model-building to predictions. Cesium has two main components - a Python library, and a web application platform that allows interactive exploration of machine learning pipelines. Take control over the workflow in a Python terminal or Jupyter notebook with the Cesium library, or upload your time-series files, select your machine learning model, and watch Cesium do feature extraction and evaluation right in your browser with the web application.

pyts is a time series classification library that seems moderately popular.

The non-python options in Forecasting are also worth looking at.

Interoperation with other languages, platforms

R

Direct API calls

You can do this using rpy2.ipython

%load_ext rpy2.ipython

%R library(robustbase)
%Rpush yy xx
%R mod <- lmrob(yy ~ xx);
%R params <- mod$coefficients;
%Rpull params

See the Revolutions blog, or Josh Devlin’s tips for more of that.

Counter-intuitively this is remarkably slow. I have experienced much greater speed in saving data to the file system in one language then loading it in another. For that, see the next

via the filesystem

Much faster, weirdly, and better documented. Recommended. Try Apache arrow.

import pandas as pd
import feather

path = 'my_data.feather'
feather.write_dataframe(df, path)
df = feather.read_dataframe(path)
library(feather)

path <- "my_data.feather"
write_feather(df, path)
df <- read_feather(path)

If that doesn’t work, try hdf5 or protobuf or whatever. There are many options. hdf5 seems to work well for me.

Julia

See pycall from julia.

Tips

Use a known-good project structure

cookiecutter data science.

Local random number generator state

⚠️ this is out of date now; the new RNG API is much better.

Seeding your RNG can be a pain in the arse, especially if you are interfacing with an external library that doesn’t have RNG state passing in the API. So, use a context manager. Here’s one that works for numpy-based code:

from numpy.random import get_state, set_state, seed

class Seed(object):
  """
  context manager for reproducible seeding.

  >>> with Seed(5):
  >>>   print(np.random.rand())

  0.22199317108973948
  """
  def __init__(self, seed):
    self.seed = seed
    self.state = None

  def __enter__(self):
    self.state = get_state()
    seed(self.seed)

  def __exit__(self, exc_type, exc_value, traceback):
    set_state(self.state)

Exercise for the student: make it work with the default RNG also

Miscellaneous learning resources

Numerical tours of signal processing in python.

Agate

agate is a stats package is not designed for high performance but for ease of use and reproducibility for non-specialists, e.g. journalists.

agate is a intended to fill a very particular programming niche. It should not be allowed to become as complex as numpy or pandas. Please bear in mind the following principles when considering a new feature:

  • Humans have less time than computers. Optimize for humans.
  • Most datasets are small. Don’t optimize for “big data”.
  • Text is data. It must always be a first-class citizen.
  • Python gets it right. Make it work like Python does.
  • Humans lives are nasty, brutish and short. Make it easy

Miscellaneous tools

  • hypertools is a generic dimensionality reduction toolkit. Is this worthwhile?
  • Bonus scientific computation is also available through GSL, most easily via CythonGSL use and reproducibility for journalists.
  • savez makes persisting arrays fast and efficient per default. Use that, if you are talking to other python processes.
  • biopython is a whole other world of phylogeny and bio-data wrangling. I’m not sure if it adheres to one of the aforementioned schools or not. Should check.
  • more options for speech/string analysis at natural language processing.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.