Numerical libraries

March 1, 2016 — April 28, 2023

computers are awful
number crunching
premature optimization

There are too many numerical libraries.

Here are some I am sifting through, skewed towards python-compatible ones.

Normally when I say numerics I mean floating-point arithmetic, but there is a whole world out there. See swmath for a search engine over published mathematical software (which does not necessarily have anything to do with numbers, but let us stay here for now..)

Figure 1

1 XLA ecology

The input language to XLA is called “HLO IR”, or just HLO (High Level Operations). The semantics of HLO are described on the Operation Semantics page. It is most convenient to think of HLO as a compiler IR. XLA takes graphs (“computations”) defined in HLO and compiles them into machine instructions for various architectures. XLA is modular in the sense that it is easy to slot in an alternative backend to target some novel HW architecture.

Most famously tensorflow and jax use XLA as a backend.

1.1 DEX

Via David Duvenaud, a language designed around XLA. (Paszke et al. 2021)

Dex is a functional, statically typed language for array processing. There are many tools for array processing, from high-level libraries like NumPy and MATLAB to low-level languages like CUDA. Dex is a new approach for high-level array processing that aims for the clarity of high-level libraries while allowing for more granular expressivity. In particular, Dex does not force you to rewrite all operations in terms of batched tensor interactions, but allows for a range of interactions. Put more simply, when learning MATLAB students are told repeatedly to “avoid for loops”. Dex gives for loops back.

Looks like an even more spartan jax with strong Haskell energy. Might be a friendly front-end to XLA, which is the main thing I want from jax.

2 LAPACK ecology

(Scott Locklin’s summary will do for this, actually)

Most dense linear algebra work presently happens in something called LAPACK. If you want to calculate eigenvectors in Matlab, Incanter/Clojure, Python, R, Julia and what have you: it’s almost certainly getting piped through LAPACK. If you’re working in C, C++ or Fortran, you’re still probably doing in via LAPACK. If you’re doing linear algebra in Java, you are a lost soul, […]but the serious people I know who do this, they allocate blobs of memory and run LAPACK on the blobs. The same thing was true 25 years ago. In fact, people have basically been running their numerics code on LAPACK for over 40 years when it was called EISPACK and LINPACK. Quite a lot of it conforms with… Fortran 66; a language invented 50 years ago which is specifically formatted for punched cards. That is an astounding fact.

LAPACK is a sort of layercake on top of something called the BLAS (“Basic Linear Algebra Subroutines”). I think BLAS were originally designed as a useful abstraction so your Eigendecomposition functions have more elementary operations in common with your QR decomposition functions. This made LAPACK better than its ancestors from a code management and complexity point of view. It also made LAPACK faster, as it allowed one to use machine optimized BLAS when they are available, and BLAS are the type of thing a Fortran compiler could tune pretty well by itself, especially on old-timey architectures without complex cache latency. While most of these functions seem boring, they are in fact extremely interesting. Just to give a hint: there is a function in there for doing matrix-multiply via the Winograd version of the Strassen algorithm, which multiplies matrices in O(N^2.8) instead of O(N^3). It’s the GEMMS family of functions if you want to go look at it. Very few people end up using GEMMS, because you have to be a numerical linear algebraist to understand when it is useful enough to justify the O(N^0.2) speedup. For example, there are exactly no LAPACK functions which call GEMMS. But GEMMS is there, implemented for you, just waiting for the moment when you might need to solve … I dunno, a generalized Sylvester equation where Strassen’s algorithm applies.

OK, so that is the background. But what linear algebra library do you use these days? There are so many — Intel MKL, Boost uBLAS etc. (and where does GSL fit in?) Many of them have weirder and fancier interfaces than LAPACK classic. One of the remarkable features is that they are all faster than all the other ones. Or rather, the easiest-to-find benchmarks for each tend to favour what that one does best. In my inexpert understanding this means that Blaze does simple operations over big arrays fastest, Armadillo does fused arrays fast-ish, ArrayFire does… everything pretty fast AFAICT. Maybe it has crappy licensing? IDK. Eigen seems to be not especially fast but must be easy to plug in to your existing code because it’s gotten everywhere. So has Armadillo, FWIW, for maybe the same reason?. Also what you intend to run it on is important — not everything supports all platforms, SIMD, threads etc.


Intel’s Matrix Kernel library is supposed be a LAPACK-alike that gets peak performance out of various intel CPUs. I’ve used it via python and julia. In neither case has it been worthwhile; It is gigantic. It possibly speeds hings up marginally, but not noticably, and it bloats little scripts into gigantic monsters. And it crashes. Maybe if you needed to squeeze an extra few percentage points out of some reasonably well-understood optimized codebase? Not prototype-friendly.

4 Armadillo/Bandicoot

Armadillo is a popular C++ wrapper around the LAPACK tools, providing lazy automagic graph optimisation of operations, or at least, that’s what some earnest guy told me at a seminar. Super popular for less horrible matrix calcs in R.

A GPU sibling by the same authors is Bandicoot.

There is a python wrapper, PyArmadillo.

  • Armadillo is a high quality linear algebra library (matrix maths) for the C++ language, aiming towards a good balance between speed and ease of use

  • Provides high-level syntax and functionality deliberately similar to Matlab

  • Useful for algorithm development directly in C++, or quick conversion of research code into production environments (e.g. software & hardware products)

  • Provides efficient classes for vectors, matrices and cubes (1st, 2nd and 3rd order tensors); dense and sparse matrices are supported

  • Integer, floating point and complex numbers are supported

  • Various matrix decompositions are provided through integration with LAPACK, or one of its high performance drop-in replacements (e.g. multi-threaded Intel MKL, or OpenBLAS)

  • A sophisticated expression evaluator (based on template meta-programming) automatically combines several operations to increase speed and efficiency

  • Can automatically use OpenMP multi-threading (parallelisation) to speed up computationally expensive operations

  • Available under a permissive license, useful for both open-source and proprietary (closed-source) software

  • Can be used for machine learning, pattern recognition, computer vision, signal processing, bioinformatics, statistics, finance, etc

  • Armadillo employs a delayed evaluation approach to combine several operations into one and reduce (or eliminate) the need for temporaries. Where applicable, the order of operations is optimised. Delayed evaluation and optimisation are achieved through recursive templates and template meta-programming.

  • While chained operations such as addition, subtraction and multiplication (matrix and element-wise) are the primary targets for speed-up opportunities, other operations, such as manipulation of submatrices, can also be optimised. Care was taken to maintain efficiency for both “small” and “big” matrices.

5 Arrayfire

Arrayfire is a CPU/OpenCL/GPU array operations library, open-sourced by a DARPA contractor, who are still spruiking it:

The ArrayFire library is our flagship product. It is a general-purpose computational tool that simplifies the process of developing software that targets parallel and massively-parallel architectures including CPUs, GPUs, and other hardware acceleration devices. ArrayFire provides software developers with a high-level abstraction of data which resides on the accelerator, the af::array object (or C-style struct). Developers write code which performs operations on ArrayFire arrays which, in turn, are automatically translated into near-optimal kernels that execute on the computational device. ArrayFire includes support for supports CPUs from all major vendors (Intel, AMD, Arm), GPUs from the dominant manufacturers (NVIDIA, AMD, and Qualcomm), as well as a variety of other accelerator devices. ArrayFire has been used successfully on devices ranging from low-power mobile phones to high-power GPU-enabled supercomputers.

It has python and julia backends. Remarkably, the julia one is asynchronous to some extent, which seems useful.

6 Eigen

Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms. Notable for being used in various other things e.g. Tensorflow’s CPU backend and Stan the Monte Carlo thingy.

  • Eigen is versatile.

    • It supports all matrix sizes, from small fixed-size matrices to arbitrarily large dense matrices, and even sparse matrices.

    • It supports all standard numeric types, including std::complex, integers, and is easily extensible to custom numeric types.

    • It supports various matrix decompositions and geometry features.

    • Its ecosystem of unsupported modules provides many specialized features such as non-linear optimization, matrix functions, a polynomial solver, FFT, and much more.

  • Eigen is fast.

    • Expression templates allow to intelligently remove temporaries and enable lazy evaluation, when that is appropriate.

    • Explicit vectorization is performed for SSE 2/3/4, AVX, FMA, AVX512, ARM NEON (32-bit and 64-bit), PowerPC AltiVec/VSX (32-bit and 64-bit) instruction sets, and now S390x SIMD (ZVector) with graceful fallback to non-vectorized code.

    • Fixed-size matrices are fully optimized: dynamic memory allocation is avoided, and the loops are unrolled when that makes sense.

    • For large matrices, special attention is paid to cache-friendliness.

  • Eigen is reliable.

    • Algorithms are carefully selected for reliability. Reliability trade-offs are clearly documented and extremely safe decompositions are available.

    • Eigen is thoroughly tested through its own test suite (over 500 executables), the standard BLAS test suite, and parts of the LAPACK test suite.

  • Eigen is elegant.

    • The API is extremely clean and expressive while feeling natural to C++ programmers, thanks to expression templates.

    • Implementing an algorithm on top of Eigen feels like just copying pseudocode.

  • Eigen has good compiler support as we run our test suite against many compilers to guarantee reliability and work around any compiler bugs. Eigen also is standard C++98 and maintains very reasonable compilation times.

7 Blaze

Blaze is another bloody one, which seems to have threaded computation in particular as its USP, and not caring about backward-compatibility (which is both plus and minus.)

Blaze is an open-source, high-performance C++ math library for dense and sparse arithmetic. With its state-of-the-art Smart Expression Template implementation Blaze combines the elegance and ease of use of a domain-specific language with HPC-grade performance, making it one of the most intuitive and fastest C++ math libraries available.

The Blaze library offers …

  • high performance through the integration of BLAS libraries and manually tuned HPC math kernels
  • vectorization by SSE, SSE2, SSE3, SSSE3, SSE4, AVX, AVX2, AVX-512, FMA, and SVML
  • parallel execution by OpenMP, HPX, C++11 threads and Boost threads
  • the intuitive and easy to use API of a domain specific language
  • unified arithmetic with dense and sparse vectors and matrices
  • thoroughly tested matrix and vector arithmetic
  • completely portable, high quality C++ source code

8 Elemental

Elemental is (was?) a hyped new entrant, supporting the newer things that modern matrix manipulation necessitates, including highly evolved built-in sparse/compressive support, matrix_factorisation_nonneg etc.

Elemental is an open-source library for distributed-memory dense and sparse-direct linear algebra and optimization which builds on top of BLAS, LAPACK, and MPI using modern C++ and additionally exposes interfaces to C and Python (with a Julia interface beginning development).

[…] Elemental supports a wide collection of sequential and distributed-memory operations, including support for dense and sparse-direct linear algebra, Linear, Quadratic and Second-Order Cone Programming, and lattice reduction. Furthermore, it supports such functionality for real and complex single-precision, double-precision, “double-double”, “quad-double”, quad-precision, and arbitrary-precision floating-point arithmetic.

The C++11 API is by far the most complete, but a large percentage of the library is also exposed to C and Python interfaces. Please see the README for an up-to-date list of unique functionality.

The development of Elemental has led to a number of research articles and is incorporated into a variety of scientific (e.g., libSkylark, PETSc, and CVXPY) and industrial projects (e.g., within Finite Element companies).

The putative homepage is alarmingly defunct, but there is an intimidating list of features (and dependencies) in the source code repo. However the last update was long enough ago that I presume this is a dead project.

9 nalgebra

A rust lapack interface/implementation:

Special feature: Compiles to WebAssembly (HT Emma Krantz for pointing this out).

10 faer

faer is a linear algebra library developed in the Rust programming language. It exposes a low level API, similar to the one offered by BLAS/Lapack libraries, modified to better match the capabilities of modern computers, and allows tweaking the inner parameters of the various algorithms.

A high level interface is planned in the future, once the basic building blocks are ready and a suitable API design is stabilized.

11 rapl

12 Geometry

The Computational Geometry Algorithms Library

CGAL is a software project that provides easy access to efficient and reliable geometric algorithms in the form of a C++ library. CGAL is used in various areas needing geometric computation, such as geographic information systems, computer aided design, molecular biology, medical imaging, computer graphics, and robotics.

The library offers data structures and algorithms like triangulations, Voronoi diagrams, Boolean operations on polygons and polyhedra, point set processing, arrangements of curves, surface and volume mesh generation, geometry processing, alpha shapes, convex hull algorithms, shape reconstruction, AABB and KD trees…

Learn more about CGAL by browsing through the Package Overview.

13 Automatic differentiation

See automatic differentiation

14 PDE Solvers

See solving all my problems.

15 GPU computation

Tensorflow again, numba, theano again, Arrayfire etc… Repeat this entire saga but with a slightly different toolchain. See GPU computation.

16 References