The matrix mathematics side of python is not so comprehensive as MATLAB.
Most of the tools are built on core matrix libraries,
scipy, do what I
want mostly, but when, for example,
I really need some fancy spline type I read
about on the internet, or you want someone to have really put in some effort
on ensuring such-and-such a recursive filter is stable, Igit might find you need
to do it yourself.
numpy though gives us all the classic fortan linear algebra nlibraries.
There are several underlying numerics libraries which can be invoked from python, as with any language with a decent FFI.
Ffor example tensorflow will invoke eigen.
PyArmadillo invokes Armadillo.
See also the strange adjacent system of GPU libraries.
Aside: A lot of useful machine-learning-type functionality, which I won’t discuss in detail here, exists in the python deep learning toolkits such as Tensorflow, pytorch and jax.; you might want to check those pages too. Also graphing is a whole separate issue, as is optimisation.
Zarr is a format for the storage of chunked, compressed, N-dimensional arrays. …
- Create N-dimensional arrays with any NumPy dtype.
- Chunk arrays along any dimension.
- Compress and/or filter chunks using any NumCodecs codec.
- Store arrays in memory, on disk, inside a Zip file, on S3, …
- Read an array concurrently from multiple threads or processes.
- Write to an array concurrently from multiple threads or processes.
- Organize arrays into hierarchies via groups.
Resembles HDF5 but makes fewer assumptions about the storage backend and more assumptions about the language frontend.
daskis a flexible library for parallel computing in Python.
howto for some examples of use cases.
Dask is composed of two parts:
Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.
Dask emphasizes the following virtues:
Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
Native: Enables distributed computing in pure Python with access to the PyData stack.
Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
Scales up: Runs resiliently on clusters with 1000s of cores
Scales down: Trivial to set up and run on a laptop in a single process
Responsive: Designed with interactive computing in mind, it provides rapid feedback and diagnostics to aid humans
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
I am not sure what the value proposition of Pytables is compared to h5py, which is fast and easy. It seems to add a layer of complexity on top of h5py, and I am not sure what this gains me. Maybe better handling of string data or something?
Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.
This data model is borrowed from the netCDF file format, which also provides xarray with a natural and portable serialization format. NetCDF is very popular in the geosciences, and there are existing libraries for reading and writing netCDF in many programming languages, including Python.
This system seems to be a developing lingua franca for the differentiable learning frameworks.
e.g. for performance or invoking external binaries.
See compiling python.
Displaying numbers legibly
Easy, but documentation is hard to find.
means “with 4 decimal points, align
x to fill 10 columns”.
All conceivable alternatives are displayed at pyformat.info.
How I set my
numpy arrays to be displayed big and informative:
np.set_printoptions(edgeitems=5, linewidth=85, precision=4, suppress=True, threshold=500)
Reset to default:
np.set_printoptions(edgeitems=3, infstr='inf', linewidth=75, nanstr='nan', precision=8, suppress=False, threshold=1000, formatter=None)
np.array_str for one-off formatting.
Local random number generator state
⚠️ this is out of date now; the new RNG API is much better.
Seeding your RNG can be a pain in the arse,
especially if you are interfacing with an external library
that doesn’t have RNG state passing in the API.
So, use a context manager.
Here’s one that works for
from numpy.random import get_state, set_state, seed class Seed(object): """ context manager for reproducible seeding. >>> with Seed(5): >>> print(np.random.rand()) 0.22199317108973948 """ def __init__(self, seed): self.seed = seed self.state = None def __enter__(self): self.state = get_state() seed(self.seed) def __exit__(self, exc_type, exc_value, traceback): set_state(self.state)
Exercise for the student: make it work with the default RNG also.