Kinda hate R, because it as much as it is a statistical dream, it is a programming nightmare? Is MATLAB too expensive when you try to run it on your cloud server farm and you’re anyway vaguely suspicious that they get kickback from the companies that sell RAM because otherwise why does it eat all your memory like that? Love the speed of C++ but have a nagging feeling that you should not need to recompile your code to do exploratory data analysis? Like the idea of Julia, but wary of depending on yet another bloody language, let along one without the serious corporate backing or history of the other ones I mentioned?
Python has a different set of warts to those other options. Its statistical library support is narrower than R  probably comparable to MATLAB. It is, however, generally fastish and nicer to debug, and integrates general programming tasks well — web servers, academic blogs, neural networks, weird art projects, online open workbooks…
The matrix mathematics side of python is not so comprehensive as
MATLAB, mind.
Most of thsee tools are built on core matrix libraries, numpy
and scipy
, do what you
want mostly, but when, for example,
you really need some fancy spline type you read
about on the internet, or you want someone to have really put in some effort
on ensuring suchandsuch a recursive filter is stable, you might find you need
to do it yourself.
There are alternative interfaces to core numeric functionality; for example tensorflow PyArmadillo, and the strange adjacent system of GPU libraries.
OTOH, it’s ubiquitous and free, free, free so you don’t need to worry about stupid licensing restrictions, and the community is enormous, so it’s pretty easy to answer any questions you may have.
But in any case, you don’t need to choose. Python interoperates with all these other languages, and indeed, makes a specialty of gluing stuff together.
Aside: A lot of useful machinelearningtype functionality, which I won’t discuss in detail here here, exists in the python deep learning toolkits such as Tensorflow and Theano; you might want to check those pages too. Also graphing is a whole separate issue, as is optimisation.
In recent times, a few major styles have been ascendant in the statistical python scene.
DataFramestyle ecosystem
pandas
plus statsmodels
look a lot more like R, but stripped down.
On the minus side, they lack some language features of R
(e.g. regression formulae are not first class language features).
On the plus side, they lack some language features of R,
such as the object model being a box of turds, and copybyvalue semantics,
and broken reference counting and hating kittens.
Then again, there is no ggplot2
and a smaller selection of implemented
algorithms compared to R, so you don’t get this stuff for free.
pandas is moreorless a dataframe class for python. Lots of nice things are built on this, such as …
statsmodels, which is moreorless R, but Python. Implements
 Linear regression models
 Generalized linear models
 Discrete choice models
 Robust linear models
 Many models and functions for time series analysis
 Nonparametric estimators
 A wide range of statistical tests
 etc
patsy implements a formula language for
pandas
. This does lots of things, but most importantly, it builds design matrices
(i.e. it knows how to represent
z~x^2+x+y^3
as a matrix, which only sounds trivial if you haven’t tried it)  statefully preconditions data (e.g. constructs data transforms that will correctly normalise the test set as well as the training data.)
 builds design matrices
(i.e. it knows how to represent
Matrixstyle ecosystem
scikitlearn
exemplifies a machinelearning style,
with lots of abstract feature construction
and predictiveperformance style model selection
built around homogeneouslytyped (only floats, only ints)
matrices instead of dataframes.
This style wil be more familiar to MATLAB
users than to R users.

is an open source, lowcode machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.
scikitlearn (sklearn to its friends) is the flagship of this fleet. It is fast, clear and welldesigned. I enjoy using it for implementing MLtype tasks. It has various algorithms such as random forests and linear regression and Gaussian processes and reference implementations of many algorithms, both à la mode and passé. Although I miss sniff
glmnet
in R for lasso regression.Tensorflow includes a
sklearn
like API in its library.
SKLL (pronounced “skull”) provides a number of utilities to make it simpler to run common scikitlearn experiments with pregenerated features.

…provides a bridge between
sklearn
’s machine learning methods and pandasstyle Data Frames.In particular, it provides:
a way to map DataFrame columns to transformations, which are later recombined into features
a way to crossvalidate a pipeline that takes a pandas DataFrame as input.
libROSA, the machine listening library is more or less in this school.
kernelregression I should check out. Claims to supplement the convolution kernel algorithms in
sklearn
.“pystruct aims at being an easytouse structured learning and prediction library.”
Currently it implements only maxmargin methods and a perceptron, but other algorithms might follow. The learning algorithms implemented in PyStruct have various names, which are often used loosely or differently in different communities. Common names are conditional random fields (CRFs), maximummargin Markov random fields (M3N) or structural support vector machines.
The new fusion: xarray
xarray is a system which labels arrays, leading to a
Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPylike multidimensional arrays, which allows for a more intuitive, more concise, and less errorprone developer experience.
This data model is borrowed from the netCDF file format, which also provides xarray with a natural and portable serialization format. NetCDF is very popular in the geosciences, and there are existing libraries for reading and writing netCDF in many programming languages, including Python.
This system seems to be a developing lingua franca for the differentiable learning frameworks. Somewhat similar to the matrixstyle interface of the scikitlearn, but weirder, is the computationflowgraphstyle of the differentiable learning libraries such as pytorch and Tensorflow, and the variational approximation / deep learning communities.
ArviZ is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, data storage, sample diagnostics, model checking, and comparison.
The goal is to provide backendagnostic tools for diagnostics and visualizations of Bayesian inference in Python, by first converting inference data into xarray objects. See here for more on xarray and ArviZ usage and here for more on
InferenceData
structure and specification.
Time series
Suddenly there is a big pile of time series packages, whcih might be useful. They tend to be based on time series classification and feature extraction which I… am sure is useful for something but I have personally never needed.
See tslearn
tslearn is a Python package that provides machine learning tools for the analysis of time series. This package builds on (and hence depends on) scikitlearn, numpy and scipy libraries.
It integrates with other time series tools, for example:
It automatically calculates a large number of time series characteristics, the so called features. Further the package contains methods to evaluate the explaining power and importance of such characteristics for regression or classification tasks.
Cesium is an endtoend machine learning platform for timeseries, from calculation of features to modelbuilding to predictions. Cesium has two main components  a Python library, and a web application platform that allows interactive exploration of machine learning pipelines. Take control over the workflow in a Python terminal or Jupyter notebook with the Cesium library, or upload your timeseries files, select your machine learning model, and watch Cesium do feature extraction and evaluation right in your browser with the web application.
sktime is a generic forcasting library modelled after scikitlearn
.
pyts is a time series calssification library that seems moderately popular.
Interoperation with other languages, platforms
R
Direct API calls
You can do this using rpy2.ipython
%load_ext rpy2.ipython
%R library(robustbase)
%Rpush yy xx
%R mod < lmrob(yy ~ xx);
%R params < mod$coefficients;
%Rpull params
See the Revolutions blog, or Josh Devlin’s tips for more of that.
Counterintuitively this is remarkably slow. I have experienced much greater speed in saving data to the file system in one language then loading it in another. For that, see the next
via the filesystem
Much faster, weirdly, and better documented. Recommended. Try feather, which is based on Apache arrow.
import pandas as pd
import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
df = feather.read_dataframe(path)
library(feather)
path < "my_data.feather"
write_feather(df, path)
df < read_feather(path)
If that doesn’t work, try hdf5
or protobuf
or whatever.
There are many options. hdf5
seems to work well for me.
Julia
See pycall from julia.
Displaying numbers legibly
Easy, but documentation is hard to find.
Floats
Sven Marnach distills everything adroitly:
print("{:10.4f}".format(x))
means “with 4 decimal points, align x
to fill 10 columns”.
All conceivable alternatives are displayed at pyformat.info.
Arrays
How I set my numpy
arrays to be displayed big and informative:
np.set_printoptions(edgeitems=5,
linewidth=85, precision=4,
suppress=True, threshold=500)
Reset to default:
np.set_printoptions(edgeitems=3, infstr='inf',
linewidth=75, nanstr='nan', precision=8,
suppress=False, threshold=1000, formatter=None)
There are a lot of ways to do this one.
See also np.array_str
for oneoff formatting.
Tips
Use a knowngood project structure
Local random number generator state
Seeding your RNG can be a pain in the arse,
especially if you are interfacing with an external library
that doesn’t have RNG state passing in the API.
So, use a context manager.
Here’s one that works for numpy
based code:
from numpy.random import get_state, set_state, seed
class Seed(object):
"""
context manager for reproducible seeding.
>>> with Seed(5):
>>> print(np.random.rand())
0.22199317108973948
"""
def __init__(self, seed):
self.seed = seed
self.state = None
def __enter__(self):
self.state = get_state()
seed(self.seed)
def __exit__(self, exc_type, exc_value, traceback):
set_state(self.state)
Exercise for the student: make it work with the default RNG also
Miscellaneous learning resources
Agate
agate is a stats package is not designed for high performance but for ease of use and reproducibility for nonspecialists, e.g. journalists.
agate is a intended to fill a very particular programming niche. It should not be allowed to become as complex as numpy or pandas. Please bear in mind the following principles when considering a new feature:
 Humans have less time than computers. Optimize for humans.
 Most datasets are small. Don’t optimize for “big data”.
 Text is data. It must always be a firstclass citizen.
 Python gets it right. Make it work like Python does.
 Humans lives are nasty, brutish and short. Make it easy
Miscellaneous tools
 hypertools is a generic dimensionality reduction toolkit. Is this worthwhile?
 Bonus scientific computation is also available through GSL, most easily via CythonGSL use and reproducibility for journalists.
 savez makes persisting arrays fast and efficient per default. Use that, if you are talking to other python processes.
 biopython is a whole other world of phylogeny and biodata wrangling. I’m not sure if it adheres to one of the aforementioned schools or not. Should check.
 more options for speech/string analysis at natural language processing.
No comments yet!