Julia arrays

December 31, 2019 — November 14, 2020

computers are awful

julia

number crunching

premature optimization

Arrays, presumably numerical, in julia.

1 Iteration/indexing/etc

Julia arrays are 1-indexed, right? No. They have arbitrary indices. Some care is needed to support such (recently introduced) arbitrary indexing which is not reflected in all documentation online. Best practice is not to count from 1:length(A).

If you just need to visit each element, iterate using eachindex.

If you want to iterate by axes, instead of:

function mycopy!(dest::AbstractVector, src::AbstractVector)
    length(dest) == length(src) || throw(
        DimensionMismatch("vectors must match"))
    # OK, now we’re safe to use @inbounds, right? (not anymore!)
    for i = 1:length(src)
        @inbounds dest[i] = src[i]
    end
    dest
end

you do:

function mycopy!(dest::AbstractVector, src::AbstractVector)
    axes(dest) == axes(src) || throw(
        DimensionMismatch("vectors must match"))
    for i in LinearIndices(src)
        @inbounds dest[i] = src[i]
    end
    dest
end

1.1 Concatenation

Looking for the stack operation, where you combine, e.g. 2D arrays into 3D arrays? Use cat on a non-existent dimension, e.g.

cat(vec_of_matrices...;dim=[3])

There is an under-development keyword that requires no ....

1.2 Generic multidimensional indexing

Generic multidimensional algorithms are not necessarily intuitive but are possible. There are several methods. Broadcasting is often simplest and fastest but requires some thought.

There seems to be stigma attached to using macros for indexing, but there is a simple and transparent macro system called Base.Cartesian.@generated; See Mike Innes’ explanation.

If you want to avoid macros but you can’t work out how to get your broadcasts to line up, there is the CartesianIndices system, which is compact, general but not especially clear. Here is a surprisingly simple (sic) one-pole filter implementation

function expfiltdim(x, dim::Integer, α)
    s = similar(x)
    Rpre = CartesianIndices(size(x)[1:dim-1])
    Rpost = CartesianIndices(size(x)[dim+1:end])
    _expfilt!(s, x, α, Rpre, size(x, dim), Rpost)
end

function _expfilt!(s, x, α, Rpre, n, Rpost)
    for Ipost in Rpost
        # Initialize the first value along the filtered dimension
        for Ipre in Rpre
            s[Ipre, 1, Ipost] = x[Ipre, 1, Ipost]
        end
        # Handle all other entries
        for i = 2:n
            for Ipre in Rpre
                s[Ipre, i, Ipost] = α*x[Ipre, i, Ipost] +
                    (1-α)*s[Ipre, i-1, Ipost]
            end
        end
    end
    s
end

🏗

You can avoid any of these for a couple of common cases. EllipsisNotation provides compact intuitive syntactic sugar if you wish to get at the first or last axis.

A[..,1] = [2 1 4 5
           2 2 3 6]

A[..,2] = [3 2 6 5
          3 2 6 6]

A[:,:,1] == [2 1 4 5
             2 2 3 6] #true

For iterating over one index of an arbitrary array, mapslices is convenient, or for extracting a slice along an arbitrary index, selectdim.

There is also AxisAlgorithms, which does batch matrix multiplications and inversions for a couple of handy cases.

1.3 Arrays are column-major

The first index should change fastest. That is, walk over the last index in your outer loop, and apply operations to slices comprising the other indices.

Multidimensional arrays in Julia are stored in column-major order. This means that arrays are stacked one column at a time. This can be verified using the vec function or the syntax [:] … copy_col_row is much faster than copy_row_col because it follows our rule of thumb that the first element to appear in a slice expression should be coupled with the inner-most loop. This is expected because copy_cols respects the column-based memory layout of the Matrix and fills it one column at a time. Additionally,… it follows our rule of thumb that the first element to appear in a slice expression should be coupled with the inner-most loop.

1.4 Einstein summation

Einstein summation is a handy way of thinking about lots of tensor operations which is not as esoteric as it sounds, including, e.g. batch matrix multiply. TensorOperations.jl does some optimised einstein summations and is GPU-compatible. Einsum.jl does reputedly more general Einstein-type summations but with possibly less optimisation? mcabbott/Tullio.jl: ⅀ is the newest entrant:

Used by itself the macro writes ordinary nested loops much like Einsum.@einsum. One difference is that it can parse more expressions (such as convolution, and worse). Another is that it will use multi-threading (via Threads.@spawn) and recursive tiling, on large enough arrays. But it also co-operates with various other packages, provided they are loaded before the macro is called

2 Structured matrices

Chris says

One of the best scientific computing packages is BandedMatrices.jl. It lets you directly define banded matrices and then overloads things like * and \ to use the right parts of BLAS. Compared to just using a sparse matrix (the standard MATLAB/Python way), this is SCREAMING FAST (the QR factorization difference is pretty big). Banded matrices show up everywhere in scientific computing (pretty much every PDE has a banded matrix operator). If it’s not a banded matrix, then you probably have a block-banded matrix, and thank god that @dlfivefifty also wrote BlockBandedMatrices.jl which will utilize the same speed tricks but expanded to block banded matrices (so, discretizations of systems of PDEs).

2.1 Shunting to GPUs

This section is out of date with high probability

Coding on GPUs for Julia is, at least in principle, easy. Thus you can possibly glean some nearly-free acceleration by using, for example, CuArrays to push computation onto GPUs. Simon Danisch has written the simplest examples of GPUs usage in julia. Some thought in how you structure algorithms is needed. AFAICT the hippest interface is the vendor-neutral GPUArrays which, e.g. (I think) for CUDA will transparently use CUArrays and CUDANative, for e.g. an NVIDIA GPU. You can also invoke them in a vendor-specific way, which looks easy. There is even activist support for google cloud TPU compilation. How well this vectorises in practice is not clear to me yet.

Ground zero for this appears to be JuliaGPU. I am not up to speed on what is happening at the moment.

GPUify loops provides “Support for writing loop-based code that executes both on CPU and GPU”, which might avoid some of the sharp edges of dot fusion type vectorisation.

2.2 Dot fusion

a.k.a. broadcasting wizardry.

There are a couple of issues; dot syntax in julia provides a means to both compactly express vectorized operations in julia and also to do smart broadcasting and operation fusing (which compiles functions down to vectorised kernels collapsing intermediate operations.) The details are non-obvious, but you can reportedly often ignore them and just use the @. macro to magically invoke it. In most cases it works transparently on your own functions too.

Note the manual says:

In Julia, vectorized functions are not required for performance, and indeed it is often beneficial to write your own loops (see Performance Tips), but they can still be convenient.

This is directly contradicted shortly after by the same manual and also the ephemeral julia discussion on the net. As some recent packages note you probably do want to use this syntax for vector operations because it works on GPUs, whereas using basic iteration syntax might fail to parallelize to GPUs.

In practice some thought is required to get this working; One might need reshape to get the dimensions lined up, or more complicated indexing. Pankaj Mishra’s note includes an example of how you would do complicated indexing and mutation in parallel over sparse matrices using getindex and setindex!.

This is not a huge problem; vectorised syntax is not so opaque.

But what if I want to do custom broadcasting because, e.g. I can do some operation over a grid faster using an FFT, so the FFT must be taken when I broadcast? How do I define custom broadcasting for this kind of case for some type in julia?

I’ll try to work this out from the moving parts in the ApproxFun system. 🏗

2.3 Loop vectorisation

LoopVectorization

This library provides the @avx macro, which may be used to prefix a for loop or broadcast statement. It then tries to vectorize the loop to improve runtime performance.

The macro assumes that loop iterations can be reordered. It also currently supports simple nested loops, where loop bounds of inner loops are constant across iterations of the outer loop, and only a single loop at each level of noop lest. These limitations should be removed in a future version.