Machine learning and statistics in Julia
November 27, 2019 — May 27, 2022
Stats/ML and also DSP in Julia.
1 Machine learning
Let’s put the automatic differentiation, the optimizers and the samplers together to do differentiable learning!
The deep learning toolkits have shorter feature lists than the lengthy ones of those fancy python/C++ libraries (e.g. mobile app building, cuDNN-backed optimisations are all less present in julia libraries) But maybe elegance/performance of Julia makes some of those features irrelevant? I for one don’t care about most of those because I’m a researcher not a deployer.
Having said that, Tensorflow.jl gets all the features, because it invokes C++ tensorflow. Surely one misses the benefit of Julia this way, since there are two different array-processing infrastructures to data between, and a different approach to JIT versus pre-compiled execution. Or no?
Flux.jl sounds like a reimplementation of Tensorflow-style differentiable programming inside Julia, which strikes me as the right way to do this to benefit from the end-to-end-optimised design philosophy of Julia.
Flux is a library for machine learning. It comes “batteries-included” with many useful tools built in, but also lets you use the full power of the Julia language where you need it. The whole stack is implemented in clean Julia code (right down to the GPU kernels) and any part can be tweaked to your liking.
It’s missing some features of e.g. Tensorflow, but includes compensatory suprising/unique feature combinations. GPU support seems to suggest it will support common CUDA optimizations (even some CuDNN ops) although i have suspicious that not every CUDA opp is supported and also CUDA itself can be scanty (does CUDA do a GPU Discrete Cosine Transform yet?).
Its end-to-end Julia philosophy supports neat tricks. Favourite: DiffEqFlux — see below — which makes Neural ODEs sort-of simple to create.
I have not used it enough to know yet, but I suspect that the generic nature of Flux works against it in one sense, which is that I imagine there are not convenient distributed multi-GPU trainers at the moment. I should confirm this.
Knet.jl is another deep learning library that claims to show the ease of implementing deep learning frameworks in Julia.
Alternatively, Mocha.jl is a belt-and-braces deep learning thing, with a library of pre-defined layers deprecated and unmaintained.
If one were aiming to do that, why not do something left-field like use the dynamical systems approach to deep learning? This neat trick was popularised by Haber and Ruthotto et al, who have released some of their models as Meganet.jl. I’m curious to see how they work. (seems to have paused).
There are various Gaussian Process options.
MLJ is a scikit-learn
-like pipeline for data analysis in Julia which standardises model composition automates some of the training etc. It has various adaptors for other ML systems via MLJModels.
See also * FluxTraining.jl * FluxML/FastAI.jl: Port of FastAI V2 API to Julia
2 Statistics, probability and data analysis
Hayden Klok and Yoni Nazarathy are writing a free Julia Statistics textbook (preprint) (Nazarathy and Klok 2021) which seems a thorough introduction to statistics as well as Julia, albeit statistics in a classical frame that won’t be fashionable with either your learning theory or Bayesian types.
A good starting point for doing stuff is JuliaStats which organisation produces many statistics megapackages, for kernel density estimates, generalised linear models, loess etc. Install them all using Statskit:
Less well known but handy is F. Bagge Carlson’s TotalLeastSquares which does neat errors-in-variables models Bagge Carlson, F., Machine Learning and System Identification for Estimation in Physical Systems (PhD Thesis 2018).
2.1 Data frames
The workhorse data structure of statistics.
This was complicated for a while but now I think it has settled down to be simple: Data frames are provided by DataFrames.jl. AFAICT this is the only one we need to care about now.1 Legacy compatibility is provided by IterableTables.jl to translate where needed between various DataFrame-like sources.
You can load a lot of the R standard datasets using RDatasets.
As far as sophisticated processing:
DataFramesMeta has been recommended, as a tidyverse analogue for julia. One can access DataFrames (and DataTables and SQL databases and streaming data sources) using It seems to be very active.
Query.jl looks similar, and is integrated with Iterabletables.jl
Query is a package for querying julia data sources. It can filter, project, join and group data from any iterable data source, including all the sources supported in IterableTables.jl. One can for example query any of the following data sources: any array, DataFrames, DataStreams (including CSV, Feather, SQLite, ODBC), DataTables, IndexedTables, TimeSeries, Temporal, TypedTables and DifferentialEquations (any
DESolution
). It seems less active ATM than DataFramesMeta though.
Another alternative: tidyverse-like behaviour via the Pipe
or Chain
packages;
- Chain.jl: Even More Convenient Piping
- jkrumbiegel/Chain.jl: A Julia package for piping a value through a series of transformation expressions using a more convenient syntax than Julia’s native piping functionality.
DataFrames taste better with InvertedIndices, which allow searching by negation. I think this is redundant for recent DataFrames though.
2.2 Frequentist statistics
Lasso and other sparse regressions are available in Lasso.jl which reimplements the lasso algorithm in pure Julia, GLMNET.jl which wrap the classic Friedman FORTAN code for same. There is also (functionality unattested) an orthogonal matching pursuit one called OMP.jl but that algorithm is simple enough to bang out oneself in an afternoon, so no stress if it doesn’t work. Incremental/online versions of (presumably exponential family) statistics are in OnlineStats. MixedModels.jl
is a Julia package providing capabilities for fitting and examining linear and generalized linear mixed-effect models. It is similar in scope to the
lme4
package for R.
2.3 Probabilistic programming
Probabilistic programming! Bayesian inference considered broadly! Several options on the probabilistic programming page are based on julia, specifically, Turing.jl (source), Mamba.jl, Gen (source), DynamicHMC, Klara.jl, and probably others. Of these, Gen and Turing seem the most active.
3 Differentiating, optimisation
3.1 Optimizing
JuMP support many types of optimisation, including over non-continuous domains, and is part of the JuliaOpt family of confusingly diverse optimizers, which invoke various sub-families of optimizers. The famous NLOpt solvers comprise one such class, and they can additionally be invoked separately.
Unlike NLOpt and the JuMP family, Optim.jl (part of JuliaNLSolvers, a different family entirely) solves optimisation problems purely inside Julia. It has nieces and nephews such as LsqFit for Levenberg-Marquardt non-linear least squares fits. Optim.jl
will automatically invoke ForwardDiff. Assumes mostly unconstrained problems.
Krylov.jl is a collection of Krylov-type iterative method for large iterative linear and least-squares objectives.
3.2 Autodiff
Julia is a hotbed of autodiff for technical and community reasons. Such a hotbed that it’s worth discussing in the autodiff notebook.
Closely related, projects like ModelingToolkit.jl blur the lines between equations and coding, and allow easy definition of differentiable or probabilistic programming.
4 ODEs, PDEs, SDEs
Chris Rauckackas is a veritable wizard with this stuff; read his blog.
Here is a tour of fun tricks with stochastic PDEs. There is a lot of tooling for this; DiffEqOperators … does something. DiffEqFlux (EZ neural ODEs works with Flux and claims to make Neural ODEs simple. The implementation of these things in python, for the award-winning NeurIPS paper that made them famous was a nightmare. +1 for Julia here. The neural SDE section is mostly julia; Go check that out.
5 Configuration
See experiment tracking in ML for what I mean here.
6 Matrix Factorisation and completion
NMF.jl contains reference implementations of non-negative matrix factorisation.
Laplacians.jl by Dan Spielman et al is a matrix factorisation toolkit especially for Laplacian (graph adjacency) matrices.
Once again, F. Bagge Carlson’s TotalLeastSquares solves cetain matrix factorization and completion problems.
7 Signal processing
DSP.jl has been split off from core and now needs to be installed separately. Also DirectConvolutions has sensible convolution code.
FFTs are provided by AbstractFFTs, which in-principle wraps many FFT implementations. I don’t know if there is a GPU implementation yet, but there for sure is the classic CPU implementation provided by FFTW.jl which uses FFTW internally.
As for how to use these things, Numerical tours of data sciences has a Julia edition with lots of signal processing conent.
JuliaAudio processes audio. They recommend PortAudio.jl as a real time soundcard interface, which looks sorta simple. See rkat’s example of how this works. There are useful abstractions like SampledSignals to load audio and keep the data and signal rate bundled together. Although, as SampledSignal maintainer Spencer Russell points out, AxisArrays might be the right data structure for sample signals, and you could use SampledSignals purely for IO, and ignore its data structures thereafter.
Images.jl processes images.
8 QMC
Low discrepancy and other QMC stuff. Mostly I want low discrepancy sequences. There are two options with near identical interfaces; I’m not sure of the differences.
Sobol.jl claims to have been performance profiled:
9 References
Footnotes
There are some older ones you might encounter such as DataTables.jl which are subtly incompatible in tedious ways which these days we can ignore.↩︎