The Jupyter Cinematic Universe

A constellation of somewhat-compatible technologies from which we can extract a compromise between ease of 1) actually doing data science and 2) seeming to laypeople to be doing data science 3) begrudging IT support

2017-02-09 — 2024-12-04

faster pussycat

premature optimization

python

Suspiciously similar content

A famous python-derived entrant in the scientific workbook field is called jupyter. Because of its age, it is a default for many people, and so it is worth knowing about.

Interactive “notebook” computing for various languages; python/julia/R/whatever plugs into the “kernel” interface. Jupyter allows easy(ish) online-friendly worksheets, which are both interactive and easy to export for static online use. This is handy. Handy enough that it’s sometimes worth the many rough spots, and so I conquer my discomfort and use it.

For all that Jupyter was important in advancing the field in 2014, I am notably unimpressed by Jupyter in the modern era.

Unless you have a compelling reason to use jupyter specifically, why not use something more modern?

For python: I prefer marimo
for julia: I prefer Pluto.jl, or VS Code with the julia extension
for R: RStudio, quarto etc
Other languages: …

1 Why jupyter?

What does jupyter buy us? Is it worth the set-up time spent configuring this contraption?

Jupyter is a de facto standard for running remote computation jobs interactively. The browser-based, network-friendly jupyter notebook is a natural, easy way to execute tedious computations on some other computer somewhere else, with some kind of a paper trail. In particular, it is much better over unreliable networks than are remote terminals or remote desktops, because the client/server architecture doesn’t need to do so many round-trips to get the code output back to the user. A good feature of jupyter (maybe its best) is a kind of re-designed network terminal. Certainly, if what we need to do could be executed over remote desktop or jupyter, jupyter is going to be less awful over laggy network connections, when every mouse click and keystroke involves waiting and twiddling your fingers.

What else? People make UX arguments, e.g. that jupyter is friendly and supports interactive plots and so on. I am personally ambivalent about those arguments. Jupyter can do some things better than the console. That artificially restricted comparison does not reassure me; we are not limited to the console. Indeed, most things that Jupyter does, it does worse than a proper IDE or decent code editor. Sometimes those other tools are not available on, say, the local HPC cluster or cloud compute environment, and then this becomes a relevant advantage. Usually though, we can install VS Code Remote, unless we have angered the sysadmins. If we really want a quick minimalist UI, Marimo is probably better.

These factors (UI, remote-network-friendliness) were both game-changing in 2014, and I think people are still excited about this element of jupyter, even though there are greatly superior alternatives around, some of them also free and open source, which build upon the same infrastructure but throw out the annoying python front-end.

But for now the main takeaway, I think, is that if, like me, you are confused by jupyter enthusiasts claiming it is “easy” or “fun”, it may make more sense if you mentally append the proviso “…in comparison to some other horrible thing which I was forced to use by ignorance or circumstance several years ago.” It’s like your grandad espousing a fax machine because it’s so much better than a telex machine. That comparison is true, just not very relevant.

There are other comparisons to make — some like jupyter as a documentation format/literate coding environment. Once again, sure, it is better than text files. But then, Quarto is more portable, VS Code notebook mode or marimo versions better etc.

We can generically answer “Is it worth it?” with “That depends on the alternatives. Jupyter is adequate, and commonly available.”

2 Alternatives

My ambivalence about jupyter leads me to consider other options for interactive code execution in python.

First, be aware there are many variant front-ends to jupyter, which ameliorate some of the pain points of the jupyter notebook interface, e.g. quarto uses jupyter python kernels, but disregards the jupyter notebook interface in favour of a more traditional document-based interface.

But also! There are other interactive python environments which entirely re-imagine the python notebook concept.

2.1 Marimo

marimo is a python-specific notebook alternative which solves many pain points of jupyter (HT Jean-Michel Perraud). I have now used it for practical projects and it is better than jupyter for all my needs where it is available, which is most places..

2.2 codebraid

codebraid, a “live code”-style reworking of the jupyter notebook concept.

3 Engineering pain points

What technical difficulties can we expect when using jupyter?

Not to besmirch the efforts of the jupyter developers who are doing a difficult thing, in many cases for free, but I will complain about jupyter notebook with the justification that it is best to go into these things with your eyes open.

Jupyter is often touted as a wonderful solution for data science which makes stuff generally easier. This is probably true at certain points on the learning curve, and for individuals at very specific points on their python learning journey. It is not true in general. Rather than simply easing away difficulties, jupyter introduces all manner of difficulties of its own, in which the pain points are often surprising and novel, which does not make them better.

I’m an equivocal advocate of the jupyter notebook interface, which some days seems to counteract every plus with a minus, but at least is a pretty easy starting point compared to something scary-looking like the command-line. My qualms are partly due to the particulars of jupyter’s design decisions, and partly because of the problems of notebook interfaces generally (Chattopadhyay et al. 2020).

Jupyter:

…is friendly to use
- but hard to install
…makes it easy to explore my data with graphs and such
- but hard to keep that exploration in version control
…lets me type code in easily
- but makes it hard to work out why it is not working because jupyter clashes with the fancy debugger that would make it easy to explore my code bugs.
is open source, and written in an easy scripting language, python, so it seems it should be easy to tweak to taste
- but in practice it is a nightmare to extend. It’s an ill-explained spaghetti of python, javascript, compiled libraries and browsers that relate to one another in obscure ways that few people with a day job have time to understand or contribute to.

People are not blind to these problems. There have been many reboots, rewrites, re-architectures, and reorganisations, so many that it’s hard to know what is going on. Each line of development takes place in a separate timeline in an extended cinematic multiverse, wherein the writer’s rooms are occasionally merged, but the timelines are never reconciled. Things regularly break either at the server or client side and I might need to upgrade either or both to fix it. I might have many different installs of each and need to upgrade a half-dozen different installs to keep them all working, because jupyter is deeply intertwined with the paint points python packaging hell and in many ways makes them worse by multiplying the number of python environments I need to manage. It claims to be extensible but if I use any extensions, it is a constant struggle to keep jupyter finding the many intricate dependencies that are needed to keep the entire contraption running. The sum total is IMO no more easy to run than most of the other UI development messes that we tolerate in academic software. Case study: look a dependency of a dependency of the autocomplete function broke something and thus spawned a multi-month confusion of cascading problems and cost me several hours to fix across the few dozen different python environments I manage across several computers. This kind of tedious intermittent breakage is much the cost of doing business with jupyter, and has been so for as long as I have been using the project, which is as long as it has existed.

These pain points are perhaps not so intrusive for projects of small-to-intermediate complexity and/or longevity. Indeed, jupyter seems good at making quick data science projects look smooth, shiny, and inviting. That is, at the crucial moment when I need to make my data science project look sophisticated-yet-friendly, it lures colleagues into my web(-based IDE). Then it is too late mwhahahahah you have fallen into my trap now you are committed you had better find an engineering budget to maintain this mess. This entrapment might be a feature not a bug, as far as the realities of team dynamics and their relation to software development and organisational support. We want to lure people in until our problems become their problems, because a problem shared is a problem divided. Also shared trauma is a bonding experience.

Some argue that the weird / irritating constraints of jupyter is a strength, and can even incentivise good architecture. See Guillaume Chevallier and Jeremy Howard.

I think of the famous adage “The fastest code is code that doesn’t need to run, and the best code is code you don’t need to write”. The uncharitable corollary might be “Thus, let’s make writing code horrible so that you write less of it”. That is not even necessarily a crazy position, and if that is what Guillaume and Jeremy are saying, I guess I’ll bear it in mind? This some analogue of the traffic calming measures that make a neighbourhood more liveable. If we make it horrible to drive here, people will bring in fewer cars.

Will Crichton which explores some of these themes in The Future of Notebooks: Lessons from JupyterCon.

4 Terminology

Pain point: The lexicon of jupyter is confusing. Terminology tarpit alert.

A notebook is on one hand a style of interface, to which jupyter conforms to one interpretation of. Other applications with a notebook style of interface are Mathematica and MATLAB.

Jupyter interfaces communicate with a computational backend, which is called a kernel¹

These are software packages in which a unit of development is a type of notebook file on your disk, containing both code and output of that code. (In the case of jupyter this file format is marked by file extension .ipynb, which is short for “ipython notebook”, for fraught historical reasons.) One implementation of a notebook frontend interface over a notebook protocol for jupyter is called the jupyter notebook, launched by the jupyter notebook command which will open up a javascript-backed notebook interface in a web browser. Another, more recent notebook-style interface implementation is called jupyter lab, which additionally uses much of the same jupyter notebook infrastructure but is distinct and only sometimes interoperable in ways which I do not pretend to know in depth. But there are multiple ‘frontends’ besides which interact over the jupyter notebook protocol to talk to a kernel.

Which sense of notebook is intended you have to work out from context, e.g. the following sentence is contentful:

You dawg, I heard you like notebooks, so I started up your jupyter notebook kernel in jupyter notebook

5 Jupyter as UI

See jupyter UI.

6 Front ends

See jupyter front ends.

7 Python-specific bits of Jupyter

Jupyter supports executing code multiple languages, but is itself built in python and javascript and there are some python-specific bits integrations via IPython, especially the “magics”, % and %% commands which are used to control the notebook environment.

8 Jupyter kernels

Jupyter kernels now come in (at least) 2 flavours. I do not know to what extent they are interchangeable. Classic flavour is, I think, ipykernel. xeus is a new entrant.

A persistent process (e.g. Python if you are running Python) hovers in the background waiting for code from you and sends back the results to be displayed in the notebook.

8.1 ipykernel

Notes on classic flavour.

8.1.1 Custom ipykernels

Jupyter looks for kernel specs in a kernel spec directory, depending on my platform.

Say my kernel is dan; then the definition can be found in the following location:

Unixey: ~/.local/share/jupyter/kernels/dan/kernel.json
macOS: ~/Library/Jupyter/kernels/dan/kernel.json
Win: %APPDATA%\jupyter\kernels\dan\kernel.json

See the manual for details.

How to set up Jupyter to use a virtualenv (or other) kernel? tl;dr Do this from inside the virtualenv to bootstrap it:

pip install ipykernel
python -m ipykernel install --user --name=my-virtualenv-name

Addendum: for Anaconda, we can auto-install all discoverable conda envs, which worked for me, whereas the ipykernel method did not.

conda install nb_conda_kernels

8.1.2 Custom kernel lite

e.g. if I wish to run a kernel for a standard executable but with different parameters. See here for a worked example for GPU-enabled kernels:

For computers on Linux with optimus, you have to make a kernel that will be called with optirun to be able to use GPU acceleration.

I made a kernel in ~/.local/share/jupyter/kernels/dan/kernel.json and modified it thus:

{
    "display_name": "dan-gpu",
    "language": "python",
    "argv": [
        "/usr/bin/optirun",
        "--no-xorg",
        "/home/me/.virtualenvs/dan/bin/python",
        "-m",
        "ipykernel_launcher",
        "-f",
        "{connection_file}"
    ]
}

Any script called can be set up to use CUDA but not the actual GPU, by setting an environment variable in the script, which is handy for kernels. So this could be in a script called noprimusrun:

CUDA_VISIBLE_DEVICES= $*

8.2 Xeus

Xeus kernels reimplement Jupyter kernels in C++. I think they are language agnostic (well, work with any language that can bind to C++ which is presumably everything). The Python one is the most mature, AFAICT.

Introduced in A new Python kernel for Jupyter:

Long story short:

xeus-python is a lot lighter than ipykernel, which makes it a lot easier to implement new features on top of it.

xeus-python already works with the Jupyter Lab debugger

xeus-based kernels are more versatile in that one can overload e.g. the concurrency model. This is something that Kitware’s SlicerJupyter project takes advantage of to integrate with the Qt event loop of their Qt-based desktop application.

It looks to me like the protocol is the same, but the implementation details of the bindings are different. Can anyone confirm?

9 Jupyterlite

JupyterLite

JupyterLite is a JupyterLab distribution that runs entirely in the browser built from the ground-up using JupyterLab components and extensions.

10 Hosting static Jupyter notebooks on the web

Various options. For one, GitHub will attempt to render Jupyter notebooks in GitHub repos.; I have had various glitches and inconsistencies with images and equations rendering in such notebooks. Perhaps it is better in…

The fastest way to share your notebooks - announcing NotebookSharing.space - Yuvi Panda

You can upload your notebook easily via the web interface at notebooksharing.space: Once uploaded, the web interface will just redirect you to the beautifully rendered notebook, and you can copy the link to the page and share it!

Or you can directly use the nbss-upload command line tool: …

When uploading, you can opt-in to have collaborative annotations enabled on your notebook via the open source, web standards-based hypothes.is service. You can thus directly annotate the notebook, instead of having to email back and forth about ‘that cell where you are importing matplotlib’ or ‘that graph with the blue border’. This is one of the coolest features of notebooksharing.space.

11 Hosting live Jupyter notebooks on the web

Jupyter can host online notebooks, even multi-user notebook servers — if you are brave enough to let people execute weird code on your machine. I’m not going to go into the security implications. tl;dr encrypt and password-protect that connection. Here, see this jupyterhub tutorial.

11.1 Commercial notebook hosts

NB: This section is outdated. 🏗; I should probably mention the ill-explained Kaggle kernels and Google Cloud ML execution of same, etc.

Base level, you can run one using a standard cloud option like buying compute time as a virtual machine or container, and using a Jupyter notebook for their choice of data science workflow.

Kaggle Kernels are somehow also Kaggle notebooks now or something? Anyway, it seems to execute code.
Paperspace - Gradient Notebooks

Gradient Notebooks is a web-based Jupyter IDE with free GPUs & IPUs.
sagemath runs notebooks online as part of their cocal service, with fancy features. There is a free tier and unspecified other pricing. Messy design but tidy open-source ideals.
Anaconda.org appears to be a Python package development service, but they also have a sideline in hosting notebooks. ($7/month) Requires you to use their Anaconda Python distribution tools to work, which is… a plus and a minus. The Anaconda Python distro is simple for scientific computing, but if your hard disk is as full of Python distros as mine is you tend not to want more confusing things and wasting disk space.
Microsoft’s Azure notebooks

Azure Notebooks is a free hosted service to develop and run Jupyter notebooks in the cloud with no installation. Jupyter (formerly IPython) is an open source project that lets you easily combine markdown text, executable code (Python, R, and F#), persistent data, graphics, and visualisations onto a single, shareable canvas called a notebook.
Google’s Colaboratory is hip now

Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud.

With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources, all for free from your browser.

Here is an intro and here is another

12 Pro tips and gotchas

12.1 Meta tips

Anne Bonner’s Tips, Tricks, Hacks, and Magic: How to Effortlessly Optimise Your Jupyter Notebook is actually full of useful stuff. So much stuff that upon reading it, I nearly forget my past traumas with Jupyter notebooks. If you must use Jupyter, read her article and it will make stuff seem better. Many tips on this page I gleaned from her work

12.2 boilerplate

%%writefile basic_imports.py
%load basic_imports.py

12.3 Run a notebook without the Jupyter server

See Jupyter command line.

12.4 Offline MathJax in Jupyter

e.g. for latex free mathematics.

python -m IPython.external.MathJax /path/to/source/MathJax.zip

12.5 I can’t see part of the cell!

Sometimes, you can’t see the whole code cell; part of it overflows into some weird hidden alternate dimension. This is a known issue to do with vanishing scrollbars. The workaround is simple enough:

zooming out to 90% and zooming back in to 100%, Ctrl + - / +

12.6 IOPub data rate exceeded

You got this error and you weren’t doing anything that bandwidth intensive? Say, you were just viewing a big image, not a zillion images? It’s Jupyter being conservative in version 5.0:

jupyter notebook --generate-config
atom ~/.jupyter/jupyter_notebook_config.py

update the c.ServerApp.iopub_data_rate_limit to be big, e.g. c.ServerApp.iopub_data_rate_limit = 10000000.

This is fixed after 5.0.

13 Securing

Modern Jupyter is suspicious of connections per default and will ask you either for a magic token or a password and thereafter, I think, encrypts the connection (?), which is sometimes what I want. Not always.

When I am in HPC hell, accessing Jupyter notebooks through a double SSH tunnel, the last thing I need is to put a hat on a hat by triply securing the connection. This is now adding more points of failure without any additional benefit. Also, sometimes the tokens do not work over SSH tunnels for me and I cannot work out why. I think it is something about some particular Jupyter version mangling tokens, or possibly failing to report that it has not claimed a port used by someone else (although it happens more often than is plausible for the latter case). CodingMatters notes that the following invocation will disable all Jupyter-side security measures:

$ jupyter notebook --port 5000 --no-browser --ip='*' --ServerApp.token='' --ServerApp.password=''

Obviously never do this unless you believe that everyone sharing a network with that machine has your best interests at heart. The best way to ensure that is to be accessing a machine through a firewall to a locked-down port.

There are various other useful settings which one could use to reduce security. In config file format for ~/.jupyter/jupyter_notebook_config.py:

c.ServerApp.disable_check_xsrf = True #irritates ssh tunnel for me that one time
c.ServerApp.open_browser = False # consumes a 1 time token and is pointless from a headless HPC
c.ServerApp.use_redirect_file = False # forces display of token rather than writing it to some file that gets lost in the containerisation and is useless in headless HPC
c.ServerApp.allow_password_change = True # Allow password setup somewhere sensible.
c.ServerApp.token = '' # no auth needed
c.ServerApp.password = password # actually needs to be hashed - see below

Eric Hodgins recommends this hack for a simple password without messing about trying to be clever with their browser infrastructure (which TBH does seem to break pretty often for me):

c = get_config()
c.ServerApp.ip = '*'
c.ServerApp.open_browser = False
c.ServerApp.port = 5000

# setting up the password
from IPython.lib import passwd
password = passwd("your_secret_password")
c.ServerApp.password = password

14 As a proxy

jupyter-server-proxy:

Jupyter Server Proxy lets you run arbitrary external processes (such as RStudio, Shiny Server, syncthing, PostgreSQL, etc) alongside your notebook, and provide authenticated web access to them.

Note

This project used to be called nbserverproxy. if you have an older version of nbserverproxy installed, remember to uninstall it before installing jupyter-server-proxy - otherwise they may conflict

The primary use cases are:

Use with JupyterHub / Binder to allow launching users into web interfaces that have nothing to do with Jupyter - such as RStudio, Shiny, or OpenRefine.

Allow access from frontend JavaScript (in classic notebook or JupyterLab extensions) to access web APIs of other processes running locally in a safe manner. This is used by the JupyterLab extension for dask.

15 References

Chattopadhyay, Prasad, Henley, et al. 2020. “What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities.”

Granger, and Pérez. 2021. “Jupyter: Thinking and Storytelling With Code and Data.” Computing in Science Engineering.

Himmelstein, Rubinetti, Slochower, et al. 2019. “Open Collaborative Writing with Manubot.” Edited by Dina Schneidman-Duhovny. PLOS Computational Biology.

Millman, and Pérez. 2014. “Developing Open Source Scientific Practice.”

Otasek, Morris, Bouças, et al. 2019. “Cytoscape Automation: Empowering Workflow-Based Network Analysis.” Genome Biology.

Sokol, and Flach. 2021. “You Only Write Thrice: Creating Documents, Computational Notebooks and Presentations From a Single Source.” In.

Footnotes

because in mathematics and computer science if you don’t know what to call something you call it a kernel.↩︎