Statistics software

2015-02-28 — 2019-04-18

Wherein Various Statistical and Machine‑learning Packages Are Surveyed, Including Languages (R, Python, Julia, Scala) and Tools for Streaming, Embedded Devices, and Out‑of‑core Learning.

computers are awful

number crunching

statistics

General number crunching/data analysis packages. (Specialist software is dealt with elsewhere. See machine vision, machine listening, gesture recognition, scientific workflow etc.)

R is a galaxy of statistical packages.
python is likewise a whole world of its own, and so has its own page, and a sub-notebook just for statistical issues.
Sundry deep learning packages are on their own page.
Scala’s ML ecosystem is growing.
So is julia’s.
Shogun (C++) is a large umbrella ML library with lots of algorithms and much muscular l33+ ML attitude.
dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real world problems. It is used in both industry and academia in a wide range of domains including robotics, embedded devices, mobile phones, and large high performance computing environments. Dlib’s open source licensing allows you to use it in any application, free of charge.

Includes a lot of kitchen-sink features such as a GUI toolkit and python binding.

Shark (C++)

SHARK is a fast, modular, feature-rich open-source C++ machine learning library. It provides methods for linear and nonlinear optimization, kernel-based learning algorithms, neural networks, and various other machine learning techniques (see the feature list below). It serves as a powerful toolbox for real world applications as well as research. Shark depends on Boost and CMake.
ELL is Microsoft’s open source thing targeting tiny processors and embedded devices.
weka is a Java equivalent to Shogun I suppose; Lots of algorithms in a huge package. It has a stream extension called MOA which makes it into something more like Vowpal Rabbit. Regarding which…
Vowpal Rabbit despite its abstruse project description, seems to be a good library for out-of-core linear learning i.e. regression or classification a stupendously large data set, but a model that is not stupendously large. Approaches include various online (in the sense of incremental, streaming) optimisations. L1/L2 regularisation. Linear or logistic models. (i.e. linear models). Squared, hinge, logistic, or quantiles losses. (Has a python binding btw, doesn’t everything?)
bash data science command line.
root is CERN’s crazy software. I do not know its USP.
KNIME