Tolerating Jupyter’s file format
In order to make your python code accessible, we wrapped it into encoded javascript strings. No need to thank us.
February 9, 2017 — February 13, 2023
Various notes on dealing with the jupyter file format, which in the name of convenience, gives you new and different problems to learn to manage. Because jupyter notebooks (the file format) are a weird mash of binary multimedia content and program input and output data, all wrapped up in a JSON encoding, many things that would be simple and seamless with normal text simply do not work for the .ipynb
jupyter file format. This is a huge barrier to seamlessly integrating text and notebook-based development practices, and thus impairs the jupyter mission goal of providing an easy on-ramp to data science for users. IMO this is one of the larger annoyances of the many in the jupyter system, and it would have been completely avoidable if they had settled on a less awkward textual data format than JSON for the backend storage, like R or julia did. Too late now, I suppose. There are various workarounds, ameliorations and so forth, but no one agrees on which to use so I switch between all of them constantly.
1 Version control
One of the things that breaks is diffing and merging; Things get messy when I try to put them into version control. In particular, my repository gets very large and my git client may or may not show diffs. Oh, and merging using the usual merge tools is likely to break things because merge tools do not know about idiosyncratic JSON-based storage. How do we fix that? Here are some options.
1.1 No jupyter
I want an interactive workbook that can dynamically execute code and include documentation?
No problem. There are many solutions that are superior to jupyter for this, albeit less hyped. One obvious example is knitr, which can support python. Lesser known projects like pweave also work fine.
A current favourite for me is the rather nice VS Code Interactive python mode.
There are surely many more.
There is nothing requiring me to stick to jupyter except that it is mysteriously popular and ubiquitous.
However, popularity is a strong reason. jupyter is the qwerty of python development, so we are usually stuck with it. Here are some alternatives.
1.2 nbdev
For python projects there is Nbdev which aims to solve a number of problems at once, including versioning notebooks. Main value proposition:
- A robust, two-way sync between notebooks and source code, which allow you to use your IDE for code navigation or quick edits if desired.…
- Tools for merge/conflict resolution with notebooks in a human readable format.
- active maintenance and improvement by a transparent and use-engaged crew. See nbdev v2 review: Git-friendly Jupyter Notebooks
In addition, there are other useful things
- Automatically generate docs from Jupyter notebooks. These docs are searchable and automatically hyperlinked to appropriate documentation pages by introspecting keywords you surround in backticks.
- Utilities to automate the publishing of pypi and conda packages including version number management.
- Ability to write tests directly in notebooks without having to learn special APIs. These tests get executed in parallel with a single CLI command. You can even define certain groups of tests such that you don’t have to always run long-running tests.
- Continuous integration (CI) comes setup for you with GitHub Actions out of the box, that will run tests automatically for you. Even if you are not familiar with CI or GitHub Actions, this starts working right away for you without any manual intervention.
- Integration With GitHub Pages for docs hosting: nbdev allows you to easily host your documentation for free, using GitHub pages.
- Create Python modules, following best practices such as automatically defining
__all__
(more details) with your exported functions, classes, and variables.- Math equation support with LaTeX.
So I guess that is nice? I am faintly offended that the solution to work around jupyter’s attempt to “fix” plain text storage by replacing it is to reinvent it.
nbdev does not, sadly, make the jupyter browser client itself into an easier place to type code. I still prefer to use a normal code editor or IDE for editing code, even juptyer notebooks.
1.3 Strip notebooks
You can automatically strip images and other big things from your notebook to keep them smaller and tidier if you are using git as your version control. They still work, but if you restore the notebook from git it does not any longer have all the graphics in it, and is many megabytes smaller. Usually this is fine, since you already have the code to generate them again right there, so you don’t necessarily want them around anyway.
Manually doing it is tedious. See how fastai does this automatically with automated git hooks. Not well explained, but it works. The quickest way if we are not working for fastai is nbstripout upon which the fastai hack is AFAICT based nbstripout includes its own installation script, which usually works except not in git submodules, although nothing works in submodules so no change there. You can set up attributes so that these filters and others are invoked automatically. It’s a surprisingly under-documented thing for some reason. Excluding certain from nbstripout filtering can be done several ways, including in the notebook itself. See the github issue on that theme./
tl;dr In the repository do this:
pip install nbstripout # or conda install nbstripout -c conda-forge
nbstripout --install # basic mode
nbstripout --install --attributes .gitattributes # slightly more explicit attributes
I do this for all my notebooks now. This doesn’t entirely solve the diffing and merging hurdles, but usually removes just enough pointless cruft that merging kind-of works fairly often.
1.4 Minimalist images
If we want images in the notebook to be small, tell matplotlib to use low-quality images:
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats('jpeg', quality=70)
I had to go deep to find this. The answers were in the source of Python.core.pylabtools.select_figure_formats and matplotlib_inline/backend_inline.py. It was a nice couple of months while I had that trick, but it no longer works in matplotlib 3.6.
A tip from Erin Kenna for plotly is to keep versioned images small but allow larger ones sometimes.
#%%
# Pick a renderer
# https://plotly.com/python/renderers/
renderer="plotly_mimetype"
# renderer="jpeg"
if renderer == "jpeg":
jpeg_renderer = pio.renderers['jpeg']
jpeg_renderer.width = None
jpeg_renderer.height = None
jpeg_renderer.scale = 1.8 # modifying the scale since we won’t have zoom controls
Then explicitly pass that to all plots.
This is obviously a little more manual and error-prone, but the power of being able to explicitly include some images is useful.
1.5 jupytext
Another way to make my notebooks something closer to text is jupytext. It claims to do that and more:
Wish you could edit [jupyter notebooks] in your favourite IDE? And get clear and meaningful diffs when doing version control? Then… Jupytext may well be the tool you’re looking for!
Jupytext can save Jupyter notebooks as Markdown and R Markdown documents, Julia, Python, R, Bash, Scheme, Clojure, C++ and q/kdb+ scripts.
There are multiple ways to use jupytext:
Directly from Jupyter Notebook or JupyterLab. Jupytext provides a contents manager that allows Jupyter to save your notebook to your favourite format (
.py
,.R
,.jl
,.md
,.Rmd
…) in addition to (or in place of) the traditional.ipynb
file. The text representation can be edited in your favourite editor. When you’re done, refresh the notebook in Jupyter: inputs cells are loaded from the text file, while output cells are reloaded from the.ipynb
file if present. Refreshing preserves kernel variables, so you can resume your work in the notebook and run the modified cells without having to rerun the notebook in full.On the command line.
jupytext
converts Jupyter notebooks to their text representation, and back. The command line tool can act on notebooks in many ways. It can synchronise multiple representations of a notebook, pipe a notebook into a reformatting tool likeblack
, etc… It can also work as a pre-commit hook if you wish to automatically update the text representation when you commit the.ipynb
file.
A plus is that search-and-replace would then work seamlessly across normal code and wak notebook-encoded code, which is, I assure you, a constant irritation.
One downside is that if I develop a workflow about transforming my notebook back into proper code in order to run it, I might wonder if the notebook has gained me anything over ordinary literate coding except circuitous workarounds. Am I then sure I do not secretly want to use knitr? Pro tip: although that advertises itself for R, it supports python already. That is worth mentioning again. We do not need to be arsing about with this thing. We can leave.
Anyway, jupytext sounds promising, right? There is a downside for me which is that I just (2020-11-02) spent 90 minutes trying to get jupytext to work on my jupyter notebook and it continues to sullenly fail to function. I do not have any more time for debugging this nonsense. I might check back in a year or two, but for now this is dead to me.
1.6 Diffing/merging notebooks (sort-of) natively
Ok, I surrender. We are stuck with the nasty jupyter notebook format. Fine. nbdime provides diffing and merging for notebooks. It has git integration:
I do not use this one because it seemed too slow on the large notebooks I was using and did not play well with my git GUI. In any case it does not seem to support 3 way merging, which means most merges fail and need manual intervention anyway.
Development on this is slow and the latest release is broken in the git installation phase. Fixed development release can be installed
2 Exporting notebooks
I can host static versions easily using nbviewer (and github will do this automatically.) For fancy variations I need to read how the document templates work. Here is a base latex template for e.g. academic use.
For special occasions I could write your own or customise an existing exporter Julius Schulz has virtuosic tips, e.g. using cell metadata to format figures like this:
fast.ai’s nb2md trick renders jupyter for blogging with my blogging platform of choice. See also jupytext
above.