Python packaging, dependency management and isolation



The actual experience of managing python packages.

How to install the right versions of everything for some python code I am developing? How to deploy that sustainably? How to share it with others? There are two problems: installing the right package dependencies and keeping the right dependency versions for this project. In python there are various integrated solutions that solve these two problems at once with varying degrees of success. Not so hard, but confusing and chaotic due to many long-running disputes some of which have lately resolve and many of which will probably stay with us forever.

In the before-times there were many python packaging standards. Distutils and what-not. AFAICT, unless I am migrating extremely old code I should ignore everything about these.

Sorry, GPU/TPU/etc users

tl;dr: only pip and conda support hardware specification in practice. Users of GPUs must ignore any other options, no matter how attractive all the other options might seem at first glance.

Many packages specify local versions for particular architectures as a part of their functionality. For example, pytorch comes in various flavours, which when using pip

# CPU flavour
pip install torch==1.10.0+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html
# GPU flavour
pip install torch==1.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

The local version is given by the +cpu or +cu113 bit, and it changes completely what code will be executed when using these packages. Specifying a GPU version is essential for many machine learning projects (essential, that is, if I do not want my code to run orders of magnitude slower). The details of how this can be controlled with regard to the python packaging ecosystem are somewhat contentious and complicated and thus not supported by any of the new wave options like poetry or pipenv. Brian Wilson argues:

During my dive into the open-source abyss that is ML packages and +localVersions I discovered lots of people have strong opinions about what it should not be and like to tell other people they're wrong. Other people with opinions about what it could be are too afraid of voicing them lest there be some unintended consequence. PSF has asserted what they believe to be the intended state in PEP-440 (no local versions published) but the solution (PEP-459) is not an ML Model friendly solution because the installation providers (pip, pipenv, poetry) don’t have enough standardized hooks into the underlying hardware (cpu vs gpu vs cuda lib stack) to even understand which version to pull, let alone the Herculean effort it would take to get even just pytorch to update their package metadata.

There is no evidence that this logjam will to resolve any time soon. Since I do neural network stuff and thus use GPU/CPU version of packages, this means that I can effectively ignore most of the pythen environment alternatives on this page, except conda and pip, which support the local version package system de facto, and if they are less smooth or pleasant than the new systems, at least I am not alone.

Writing a package

pip

The default python package installer. It is best spelled as

python -m pip install package_name

To snapshot dependencies:

python -m pip freeze > requirements.txt

To restore dependencies:

python -m pip install -r requirements.txt

venv works. It is a good default choice, widely supported and adequate, if not awesome, workflow.

pipx

Pro tip: pipx:

pipx is made specifically for application installation, as it adds isolation yet still makes the apps available in your shell: pipx creates an isolated environment for each application and its associated packages.

That is, pipx is an application that installs global applications for you. (There is a bootstrapping problem: How to install pipx itself.)

Save space

pip has a heavy cache overhead. If disk space is at a premium, I invoke it as pip --no-cache-dir.

Anaconda

A parallel system to pip, designed to do all the work of installing scientific python software with hefty compiled dependencies.

There are two parts here with two separate licenses

  1. the anaconda python distribution
  2. the conda python package manager.

I am slightly confused about how these two relate (Can I install a non-anaconda python distribution through the conda package manager?) There distinction is important, since licensing anaconda can be expensive. See, e.g.

Some things that are (or were?) painful to install by pip are painless via conda. Contrarywise, some things that are painful to install by conda are easy by pip.

I recommend working out which pain points are worse in this complicated decision by trial and error. Sometimes it would be worth the administrative burden of understanding conda’s current licensing and future licensing risks, but if it does not bring substantial value, choose pip.

This is an updated recommendation; previously I preferred conda — pip used to be much worse, and anaconda’s licensing used to be less restrictive. Now I think anaconda cannot be relied upon IP-wise.

Setup

Download e.g. Linux x64 Miniconda, from the download page.

bash Miniconda3-latest-Linux-x86_64.sh
# login/logout here
# or do something like `exec bash -` if you are fancy
# Less aggressive conda
conda config --set auto_activate_base false
# conda for fish users
conda init fish

Alternatively, try miniforge: A conda-forge distribution or fastchan, fast.ai’s conda mini-distribution.

curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
bash Mambaforge-$(uname)-$(uname -m).sh

It is very much worth installing one of these minimalist dists rather than the default anaconda distro, since anaconda default is gigantic but nonetheless does not have what I need, so it simply wastes space. Some of these might additionally have less onerous licensing than the mainline? I am not sure.

If I want to install something with tricky dependencies like ViTables, I do this:

conda install pytables=3.2
conda install pyqt=4

Aside: I use fish shell, so need to do some extra setup. Specifically, I add the line

source (conda info --root)/etc/fish/conf.d/conda.fish

into ~/.config/fish/config.fish. These days this is automated by

conda init fish

For jupyter compatibility one needs

conda install nb_conda_kernels

Dependencies

The main selling point of conda is that specifying dependencies for ad hoc python scripts or packages is easy.

Conda has a slightly different dependency management and packaging workflow than the pip ecosystem. See, e.g. Tim Hoppper’s explanation of this environment.yml malarkey, or the creators’ rationale and manual.

One exports the current conda environment config, by convention, into environment.yml.

conda env export > environment.yml
conda env create --file environment.yml

Which to use out of conda env create and conda create? if it involves .yaml environment configs then conda env create. Confusing errors and capability differences for these two is a quagmire of opaque errors, bad documentation and sadness.

One point of friction that I rapidly encountered is that the automatically-created environments are not terribly generic; I might specify from the command-line a package that I know will install sanely on any platform (matplotlib, say) but the version as stored in the environment file is specific to where I installed it (macos, linux, windows…) and architecture (x64, ARM…) For GPU software there are even more incompatibilities because there are more choices of architecture. So to share environments with collaborators on different platforms, I need to… be them, I guess? Buy them new laptops that match my laptop? idk this seems weird maybe I’m missing something.

Save space

NB Conda will fill up my hard disk if not regularly disciplined via conda clean.

conda clean --all --yes

If I have limited space in your home dir, might need to move the cache:

configure PKGS_DIR in ~/.condarc:

conda config --add pkgs_dirs /SOME/OTHER/PATH/.conda

Possibly also required?

chmod a-rwx ~/.conda

No MKL

I might also want to not have the gigantic MKL library installed, not being a fan. It comes baked in by default for most anaconda installs. I can usually disable it by request:

conda create -n pynomkl python nomkl

Clearly the packagers do not test this configuration so often, because it fails sometimes even for packages which notionally do not need MKL. Worth attempting, however. Between the various versions and installed copies, MKL alone was using about 10GB total on my mac when I last checked. I also try to reduce the number of copies of MKL by starting from miniconda as my base anaconda distribution, cautiously adding things as I need them.

Local environment

Local environment folder is more isolated, keeping packages in a local folder than keeping all environments somewhere global, where I need to remember what I named them all.

conda config --set env_prompt '({name})'
conda env create --prefix ./env/myenv --file environment_linux.yml
conda activate ./env/myenv

Gotcha: in fish shell the first line needs to be

conda config --set env_prompt '\({name}\)'

I am not sure why. AFAIK, fish command substitution does not happen inside strings. Either way, this will add the line

env_prompt: ({name})

to .condarc.

Robocorp

Robocorp tools claim to make conda install more generic.

RCC is a command-line tool that allows you to create, manage, and distribute Python-based self-contained automation packages - or robots 🤖 as we call them.

Together with the robot.yaml configuration file, rcc provides the foundation to build and share automation with ease.

In short, the RCC toolchain helps you to get rid of the phrase: "Works on my machine" so that you can actually build and run your robots more freely.

Mamba

Mamba is a fully compatible drop-in replacement for conda. It was started in 2019 by Wolf Vollprecht. It can be installed as a conda package with the command conda install mamba -c conda-forge.

The introductory blog post is an enlightening read, which also explains conda better than conda explains itself.

See mamba-org/mamba for more.

It explicitly targets package installation for less mainstream configurations such as R, and vscode development environments.

Provide a convenient way to install developer tools in VSCode workspaces from conda-forge with micromamba. Get NodeJS, Go, Rust, Python or JupyterLab installed by running a single command.

It also inherits some of the debilities of conda, e.g. that dependencies are platform- and architecture- specific.

venv

venv is now a built-in python virtual environment system in python 3. It doesn’t support python 2 but fixes various problems, e.g. it supports framework python on macOS which is important for GUIs, and is covered by the python docs in the python virtual environment introduction.

# Create venv
python3 -m venv ./venv --prompt some_arbitrary_name
# or if we want to use system packages:
python3 -m venv ./venv --prompt some_arbitrary_name --system-site-packages
# Use venv from fish OR
source ./venv/bin/activate.fish
# Use venv from bash
source ./venv/bin/activate

pyenv

pyenv is the core tool of an ecosystem which eases and automates switching between python versions. Manages python and thus implicitly can be used as a manager for all the other managers. The new new hipness, at least for platforms other than windows, where it does not work.

BUT WHO MANAGES THE VIRTUALENV MANAGER MANAGER? Also, what is going on in this ecosystem of bits? Logan Jones explains:

  • pyenv manages multiple versions of Python itself.
  • virtualenv/venv manages virtual environments for a specific Python version.
  • pyenv-virtualenv manages virtual environments for across varying versions of Python.

Anyway, pyenv compiles a custom version of python and as such is extremely isolated from everything else. Here is an introduction with emphasis on my area: Intro to Pyenv for Machine Learning.

Of course, because this is adjacent to the python packaging ecosystem, it immediately becomes complicated and confusing when you try to interact with the rest of the ecosystem, e.g.,

pyenv-virtualenvwrapper is different from pyenv-virtualenv, which provides extended commands like pyenv virtualenv 3.4.1 project_name to directly help out with managing virtualenvs. pyenv-virtualenvwrapper helps in interacting with virtualenvwrapper, but pyenv-virtualenv provides more convenient commands, where virtualenvs are first-class pyenv versions, that can be (de)activated. That’s to say, pyenv and virtualenvwrapper are still separated while pyenv-virtualenv is a nice combination.

Huh. I am already too bored to think. However, I did work out a command which installed a pyenv tensorflow with an isolated virtualenv:

brew install pyenv pyenv-virtualenv
pyenv install 3.8.6
pyenv virtualenv 3.8.6 tf2.4
pyenv activate tf2.4
pip install --upgrade pip wheel
pip install 'tensorflow-probability>=0.12' 'tensorflow<2.5' jupyter

For fish shell you need to add some special lines to config.fish:

set -x PYENV_ROOT $HOME/.pyenv
set -x PATH $PYENV_ROOT/bin $PATH
## fish <3.1
# status --is-interactive; and . (pyenv init -|psub)
# status --is-interactive; and . (pyenv virtualenv-init -|psub)
## fish >=3.1
status --is-interactive; and pyenv init - | source
status --is-interactive; and pyenv virtualenv-init - | source

Poetry

No! wait! The new new new hipness is poetry. All the other previous hipnesses were not the real eternal ultimate hipness that transcends time. I know we said this every previous time, but this time its real and our love will last forever.

This look will be forever.

⛔️⛔️UPDATE⛔️⛔️: OK, turns out this love was not actually quite as eternal as it seemed. Lovely elegant design does not make up for the fact that the project is logjammed and broken in various ongoing ways; see Issue #4595: Governance—or, “what do we do with all these pull requests?”. It might be usable if your needs are modest or you are prepared to jump into the project discord, which seems to be where the poetry hobbyists organise, but since I want to use this project merely incidentally, as a tool to develop something else, hobbyist level of engagement is not something I can participate in. poetry is not ready for prime-time.

⛔️⛔️UPDATE 2⛔️⛔️: As mentioned above, the poetry system does not support “local versions” and thus in practice cannot be used for machine learning applications. This project is dead to me. Do bear in mind that my opinions about will become increasingly outdated depending on when you read this, but I have expended all the effort I can afford on this project and it cannot perform my main job for me, so all the niceties are useless IMO.

Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on and it will manage (install/update) them for you.

From the introduction:

Packaging systems and dependency management in Python are rather convoluted and hard to understand for newcomers. Even for seasoned developers it might be cumbersome at times to create all files needed in a Python project: setup.py, requirements.txt, setup.cfg, MANIFEST.in and the newly added Pipfile.

So I wanted a tool that would limit everything to a single configuration file to do: dependency management, packaging and publishing.

It takes inspiration in tools that exist in other languages, like composer (PHP) or cargo (Rust).

And, finally, I started poetry to bring another exhaustive dependency resolver to the Python community apart from Conda’s.

What about Pipenv?

In short: I do not like the CLI it provides, or some of the decisions made, and I think we can make a better and more intuitive one.

Editorial side-note: Low-key dissing on similarly-dysfunctional, competing projects is an important part of python packaging.

Lazy install is via this terrifuying command line (do not run if you do not know what this does):

curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/install-poetry.py | python -

Poetry could be regarded as a similar thing to pipenv, in that it (per default, but not necessarily) manages the local dependencies in a local venv. It has a much more full-service approach than systems built on pip. For example, it has its own dependency resolver, which makes use of modern dependency metadata but also will work with previous dependency specifications by brute force if needed. It separates specified dependencies from the ones that it contingently resolves in practice, which means that the dependencies seem to transport much better than conda, which generally requires you to hand maintain a special dependency file full of just the stuff you actually wanted. In practice the many small conveniences and thoughtful workflow are helpful. For example, it will set up the current package for developing per default so that imports work as similarly as possible across this local environment and when it is distributed to users.

Recommended config:

poetry config virtualenvs.create  true
poetry config virtualenvs.in-project true  # local venvs are easier for my brain.

However, poetry does not support installing build variants/profiles, which means I cannot install GPU software, so it is useless to me.

pipenv

⛔️⛔️UPDATE⛔️⛔️: Note that the pipenv system does not support “local versions” and thus in practice cannot be used for machine learning applications. This project is dead to me. (Bear in mind that my opinions about will become increasingly outdated depending on when you read this.)

venv has a higher-level, er, …wrapper (?) interface (?) called pipenv.

Pipenv is a production-ready tool that aims to bring the best of all packaging worlds to the Python world. It harnesses Pipfile, pip, and virtualenv into one single command.

I switched to pipenv from poetry because it looked like it might be less chaos than poetry. I think it is, although the race is close.

HOWEVER, it is still pretty awful. TBH, I would just use plain pip and requirements.txt which, while it is primitive and broken, is at least broken and primitive in a well-understood way.

At time of writing the pipenv website was 3 weeks into an outage, because dependency management is a quagmire of sadness and comically broken management with terrible Bus factor. However, the backup docs site is semi-functional, albeit too curt to be useful and AFAICT outdated. The documentation site inside github is readable.

The dependency resolver is, as the poetry devs point out, broken in its own special ways. The procedure to install modern ML frameworks, for example, is gruelling.

Here is an introduction showing pipenv and venv used together.

For my configuration, important configuration settings are

export WORKON_HOME=~/.venvs

To get the venv inside the project (required for sanity in my HPC) I need the following:

export PIPENV_VENV_IN_PROJECT=1

Pipenv will automatically load dotenv files which is a nice touch.

Generic dependency managers

Does python’s slapstick ongoing shambles of a failed consensus on dependency management system fill you with distrust? Do you have the vague feeling that perhaps you should use something else to manage python since python cannot manage itself? See generic dependency managers.


No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.