Arrays, presumably numerical, in julia.

## Iteration/indexing/etc

Julia arrays are 1-indexed, right?
No.
They have arbitrary indices.
Some care
is needed to support such (recently introduced) arbitrary indexing which is not reflected in all documentation online.
Best practice is not to count from `1:length(A)`

.

If you just need to visit each element, iterate using eachindex.

If you want to iterate by axes, instead of:

```
function mycopy!(dest::AbstractVector, src::AbstractVector)
length(dest) == length(src) || throw(
DimensionMismatch("vectors must match"))
# OK, now we’re safe to use @inbounds, right? (not anymore!)
for i = 1:length(src)
@inbounds dest[i] = src[i]
end
dest
end
```

you do:

```
function mycopy!(dest::AbstractVector, src::AbstractVector)
axes(dest) == axes(src) || throw(
DimensionMismatch("vectors must match"))
for i in LinearIndices(src)
@inbounds dest[i] = src[i]
end
dest
end
```

### Concatenation

Looking for the `stack`

operation, where you combine, e.g. 2D arrays into 3d arrays?
Use cat on a non-existent dimension, e.g.

`cat(vec_of_matrices...;dim=[3])`

There is an under-development keyword is the under-development that requires no `...`

.

### Generic multidimensional indexing

Generic multidimensional algorithms are not necessarily intuitive but are possible. There are several methods. Broadcasting is often simplest and fastest but requires some thought.

There seems to be stigma attached to using macros for indexing,
but there *is* a simple and transparent macro system called
Base.Cartesian.@generated;
See Mike Innes’ explanation.

If you want to avoid macros but you can’t work out how to get your broadcasts to line up, there is the CartesianIndices system, which is compact, general but not especially clear. Here is a surprisingly simple (sic) one-pole filter implementation

```
function expfiltdim(x, dim::Integer, α)
s = similar(x)
Rpre = CartesianIndices(size(x)[1:dim-1])
Rpost = CartesianIndices(size(x)[dim+1:end])
_expfilt!(s, x, α, Rpre, size(x, dim), Rpost)
end
function _expfilt!(s, x, α, Rpre, n, Rpost)
for Ipost in Rpost
# Initialize the first value along the filtered dimension
for Ipre in Rpre
s[Ipre, 1, Ipost] = x[Ipre, 1, Ipost]
end
# Handle all other entries
for i = 2:n
for Ipre in Rpre
s[Ipre, i, Ipost] = α*x[Ipre, i, Ipost] +
(1-α)*s[Ipre, i-1, Ipost]
end
end
end
s
end
```

🏗

You can avoid any of these for a couple of common cases. EllipsisNotation provides compact intuitive syntactic sugar if you wish to get at the first or last axis.

```
A[..,1] = [2 1 4 5
2 2 3 6]
A[..,2] = [3 2 6 5
3 2 6 6]
A[:,:,1] == [2 1 4 5
2 2 3 6] #true
```

For iterating over one index of an arbitrary array, mapslices is convenient, or for extracting a slice along an arbitrary index, selectdim.

There is also AxisAlgorithms, which does batch matrix multiplications and inversions for a couple of handy cases.

### Arrays are column-major

The *first* index should change fastest.
That is, walk over the *last* index in your *outer* loop, and apply operations to
slices comprising the *other* indices.

Multidimensional arrays in Julia are stored in column-major order. This means that arrays are stacked one column at a time. This can be verified using the vec function or the syntax

`[:]`

…`copy_col_row`

is much faster than`copy_row_col`

because it follows our rule of thumb that the first element to appear in a slice expression should be coupled with the inner-most loop. This is expected because copy_cols respects the column-based memory layout of the Matrix and fills it one column at a time. Additionally,… it follows our rule of thumb that the first element to appear in a slice expression should be coupled with the inner-most loop.

### Einstein summation

Einstein summation is a handy way of thinking about lots of tensor operations which is not as esoteric as it sounds, including, e.g. batch matrix multiply. TensorOperations.jl does some optimised einstein summations and is GPU-compatible. Einsum.jl does reputedly more gneral Einstein-type summations but with possibly less optimisation? mcabbott/Tullio.jl: ⅀ is the newest entrant:

Used by itself the macro writes ordinary nested loops much like

`Einsum.@einsum`

. One difference is that it can parse more expressions (such as convolution , and worse). Another is that it will use multi-threading (via`Threads.@spawn`

) and recursive tiling, on large enough arrays. But it also co-operates with various other packages, provided they are loaded before the macro is called

## Structured matrices

One of the best scientific computing packages is BandedMatrices.jl. It lets you directly define banded matrices and then overloads things like

`*`

and`\`

to use the right parts of BLAS. Compared to just using a sparse matrix (the standard MATLAB/Python way), this is SCREAMING FAST (the QR factorization difference is pretty big). Banded matrices show up everywhere in scientific computing (pretty much every PDE has a banded matrix operator). If it’s not a banded matrix, then you probably have a block-banded matrix, and thank god that @dlfivefifty also wrote BlockBandedMatrices.jl which will utilize the same speed tricks but expanded to block banded matrices (so, discretizations of systems of PDEs).

### Shunting to GPUs

Coding on GPUs for Julia is, at least in principle, easy. Thus you can possibly glean some nearly-free acceleration by using, for example, CuArrays to push computation onto GPUs. Simon Danisch has written the simplest examples of GPUs usage in julia. Some thought in how you structure algorithms is needed. AFAICT the hippest interface is the vendor-neutral GPUArrays which, e.g. (I think) for CUDA will transparently use CUArrays and CUDANative, for e.g. an NVIDIA GPU. You can also invoke them in a vendor-specific way, which looks easy. There is even activist support for google cloud TPU compilation. How well this vectorises in practice is not clear to me yet.

Ground zero for this appeas to be JuliaGPU. I am not up to speed on what is happening here at the moment.

GPUify loops provides “Support for writing loop-based code that executes both on CPU and GPU”, which might avoid some of the sharp edges of dot fusion type vectorisation.

### Dot fusion

a.k.a. broadcasting wizardry.

There are a couple of issues here; dot syntax in julia provides a means to both compactly express vectorized operations in julia and also to do smart broadcasting and operation fusing (which compiles functions down to vectorised kernels collapsing intermediate operations.) The details are non-obvious, but you can reportedly often ignore them and just use the @. macro to magically invoke it. In most cases it works transparently on your own functions too.

Note the manual says:

In Julia, vectorized functions are not required for performance, and indeed it is often beneficial to write your own loops (see Performance Tips), but they can still be convenient.

This is directly contradicted shortly after by the same manual and also the
ephemeral julia discussion on the net.
As some recent packages note
you probably *do* want to use this syntax for vector operations because it works
on GPUs, whereas using basic iteration syntax
might fail to parallelize to GPUs.

In practice some thought is required to get this working; One might need reshape to get the dimensions lined up, or more complicated indexing. Pankaj Mishra’s note includes an example of how you would do complicated indexing and mutation in parallel over sparse matrices using getindex and setindex!.

This is not a huge problem; vectorised syntax is not so opaque.

But what if I want to do *custom* broadcasting because, e.g. I can do some
operation over a grid faster using an FFT, so the FFT must be taken *when I broadcast*?
How do I define custom broadcasting
for this kind of case for some type in julia?

I’ll try to work this out from the moving parts in the ApproxFun system. 🏗

### Loop vectorisation

This library provides the @avx macro, which may be used to prefix a for loop or broadcast statement. It then tries to vectorize the loop to improve runtime performance.

The macro assumes that loop iterations can be reordered. It also currently supports simple nested loops, where loop bounds of inner loops are constant across iterations of the outer loop, and only a single loop at each level of noop lest. These limitations should be removed in a future version.