Version Control Your Jupyter Notebooks with Jupytext

A primer on using Jupytext to simplify versioning and sharing Jupyter notebooks
Jupyter
Version Control
Python
Author

Hector

Published

Wednesday, October 25, 2023

Version Control with Jupyter Notebooks is Bothersome

If you’ve ever accidentally opened a Jupyter notebook (.ipynb) in a text editor you’ve seen something like this:

Untitled.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6c703892-4507-481a-8cbb-97a7898a2ea5",
   "metadata": {},
   "source": [
    "# Demo Notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "bc53e6ac-2ab9-4333-93b5-7daaa624934a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import math"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f30b4e72-35ab-41b1-ab5b-e4e1aac073b5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.0"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "math.sqrt(9)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

This is just a simple Jupyter notebook with a markdown cell and two Python cells:

Jupyter notebook with two cells

Jupyter notebook with two cells

This is what a simple Jupyter notebook looks like, but once you start generating plots or displaying images, you will end up with a file with blobs that look like this:

   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAGnCAYAAABl41fiAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy81sbWrAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAyaklEQVR4nO3de3Dc9X3v/9fer9rV1StZFpirL2BsYrBrks6P9LjHp3RIyaQZp5lixzOhh4zJENSZBBcwhbT4D4rHHeKMSQMnadNOnNCE6dTUCdWBcBLcuDGlMUnMJQZsZEvWxdqVVtpd7e7398dKq7uslSV99vJ8zOxI+9X3q/0IvvK+9Lm8PzbLsiwBAAAYYjfdAAAAUNkIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMCogsPIq6++qjvvvFPLly+XzWbTCy+8cMlrXnnlFX3kIx+Rx+PRtddeq29961vzaCoAAChHBYeReDyu9evX6+DBg3M6/7333tMf/uEf6uMf/7jeeOMNfelLX9LnP/95/ehHPyq4sQAAoPzYLmejPJvNph/+8Ie66667ZjznK1/5io4cOaI333wzf+wzn/mM+vr6dPTo0fm+NAAAKBPOxX6BY8eOaevWrROObdu2TV/60pdmvCaZTCqZTOafZ7NZ9fb2qq6uTjabbbGaCgAAFpBlWerv79fy5ctlt888GLPoYaSjo0ORSGTCsUgkolgspqGhIfl8vinX7Nu3T4899thiNw0AACyBs2fPasWKFTN+fdHDyHzs2bNHra2t+efRaFRXXHGFzp49q1AoZLBlAABgrmKxmFpaWlRVVTXreYseRhobG9XZ2TnhWGdnp0Kh0LS9IpLk8Xjk8XimHA+FQoQRAABKzKWmWCx6nZEtW7aora1twrGXXnpJW7ZsWeyXBgAAJaDgMDIwMKA33nh....

This is how Jupyter stores your images in the notebook, which allows it to display the plots when you open up a previously executed notebook. These outputs along with the Javascript behind the scenes makes Jupyter notebooks messy to version control. These image/figure outputs make Git want to treat the whole .ipynb file as a binary file, which means that it can be added or updated, but you won’t get the line-by-line changes you would get with other file types. There are some solutions to the need to track changes in Jupyter notebooks (discussed at the end), but one option I’ve used since 2021 is Jupytext.

What is Jupytext?

Jupytext is a tool that synchronizes your Jupyter notebooks (.ipynb) with other file types that are plain text, and can therefore be used seamlessly with Git. Jupytext was started in 2018 by Marc Wouts and was first publicly announced in a blog post in September 2018.

Jupytext lets you pair Jupyter notebooks with multiple other code or markdown file formats. My preferred format is the “percent Script” which I will detail below. Returning to our simple notebook containing a single line of code, here it is in “percent Script” .py form:

Untitled.py
# ---
# jupyter:
#   jupytext:
#     formats: ipynb,py:percent
#     text_representation:
#       extension: .py
#       format_name: percent
#       format_version: '1.3'
#       jupytext_version: 1.15.2
#   kernelspec:
#     display_name: Python 3 (ipykernel)
#     language: python
#     name: python3
# ---

# %% [markdown]
# # Demo Notebook

# %%
import math

# %%
math.sqrt(9)

You’ll notice we have a header with some of our Jupytext configuration, as well as our Jupyter kernel details. Besides the header and copious # %% comments, it looks a lot like python code. Depending on the purpose of your notebook, you could run this from the commandline like any other Python file. Each of our notebook cells is delimited by # %%, which can be executed individually in IDEs like Spyder, VS Code, or PyCharm. You’ll also notice that we no longer have the output of our cells; this is just our code (this could be a deal breaker for you). The output still exists in the original .ipynb file, but this .py file is kept clean. Depending on how you configure Jupytext, the paired code files can update every time you save the .ipynb file.

Working with .py Files is Easy!

Most of the benefit of using Jupytext (for me) stems from the ease of working with Python files directly. If this is not a big deal for you, there isn’t much reason to use Jupytext.

These are the benefits I see:

  • Version control is simple and clean
  • All your favorite formatters and linters (black, isort, ruff) work natively with .py files, although.ipynb files are getting more support than they used to
  • Jupyter’s interactivity is unbeatable, but IDEs like VS Code or PyCharm are more feature rich and are more responsive when writing code
  • If you use the “percent Script” file format, you can run your notebook in other IDEs, too

Jupytext Usage

Once Jupytext is installed (I install it in my main python environment where Jupyter is installed, not in each Python virtual environment/Jupyter kernel), you can choose to pair individual Jupyter notebooks with the format of your choice by bringing up the Jupyter Command Palette (Ctrl/Command + Shift + C):

Available Jupytext pairing options

Available Jupytext pairing options

This will create a file in the same folder with the same file name, but with a new file extension according to the selected format. If you delete either of these paired files, Jupytext will re-generate it after you save the other paired file! Using the Jupyter Command Palette method, you can pair multiple file types, if they all have different file extensions. To stop pairing, open the Jupyter Command Palette and uncheck what you selected (or select Unpair Notebook at the bottom).

Open Files as Notebooks

After installing Jupytext, you’ll see that the icons of Jupytext-eligible pair file types have notebook icons, even if they haven’t been paired to anything.

Files with notebook icons

Notebook Icons

Furthermore, if you right click on one of these Jupytext-eligible files, you can choose to open it as a notebook. With the default settings (which can be changed: Always Open .py as Notebook), all of these files will still open as their plain, original file type in Jupyter.

Files with notebook icons

Open Files as Notebooks

You can treat .py files (that start empty for best results) as Jupyter notebooks by opening them in this way, without ever pairing them to an actual .ipynb. Jupytext is maintaining one for you somewhere else, but you don’t have to worry about that.

Configuring Jupytext with pyproject.toml/jupytext.toml

In addition to pairing notebooks on an individual basis, you can also use a configuration file in your folder or repo to set some default Jupytext behavior. For example, this is how I like to setup my analysis projects:

.
├── .gitignore
├── jupytext.toml
├── pyproject.toml
├── data/
├── notebooks/
│   ├── Notebook1.py
│   ├── Notebook2.py
│   └── Notebook3.py
├── src/
│   ├── file1.py
│   └── file2.py
└── nb_ipynb/

I work with the notebook .py files and completely ignore the .ipynb files that are tucked away in nb_ipynb/. I assume they are there, but I never look at them! I ignore .ipynb files in my .gitignore too!

Here is the jupytext.toml file I use to accomplish this:

jupytext.toml
notebook_metadata_filter = "all,-widgets,-varInspector"

Pair scripts in subfolders of 'notebooks' to notebooks in subfolders of 'nb_ipynb'
[formats]
"nb_ipynb/"="ipynb"
"notebooks/" = "py:percent"

This configuration will also work if I have multiple analysis projects within the same folder:

.
├── jupytext.toml
├── pyproject.toml
├── Project1/
│   ├── data/
│   ├── notebooks/
│   │   ├── Notebook1.py
│   │   └── Notebook2.py
│   ├── src/
│   │   ├── file1.py
│   │   └── file2.py
│   └── nb_ipynb/
└── Project2/
    ├── data/
    ├── notebooks/
    │   ├── Notebook1.py
    │   └── Notebook2.py
    ├── src/
    │   ├── file1.py
    │   └── file2.py
    └── nb_ipynb/

Always Open .py as Notebook

I hardly ever open plain Python files in Jupyter. I use Jupyter for interactive coding but leave package development or writing shared files to VS Code. Therefore, you can set Jupytext as the default viewer for various file types, so when double clicking on a .py or .rmd file, they will open as a Jupyter notebook. You can still right click one of these files and choose to open it its original form.

Building a Quarto website with Jupytext (this website!)

I have flipped things about a bit, but I am also using Jupytext with this website, which is made using Quarto. Quarto pages can be rendered from .md, .qmd, or .ipynb files. If you are running Python cells in a .qmd file, I believe the .qmd file is converted to an .ipynb file first (and run) before being rendered. While it would be easy to keep everything in Jupyter notebooks, RStudio does offer some benefits when working with .qmd files, so it has been convienent to keep the file type on hand. Some posts (like this one), don’t need any Python at all, and it is pure .qmd. When building the website, I believe Quarto looks for the file types it can render and renders them all. It ignores files or folders that start with an underscore, so I put the replicate files there. This is how my directory looks:

.
├── jupytext.toml
└── blog/
    ├── text-post/
    │   └── index.qmd
    ├── python-post-1/
    │   ├── _notebooks/
    │   │   └── index.qmd
    │   ├── _notebooks_py/
    │   │   └── index.py
    │   └── index.ipynb
    └── python-post-2/
        ├── _notebooks/
        │   └── index.qmd
        ├── _notebooks_py/
        │   └── index.py
        └── index.ipynb

This is the jupytext.toml I use:

jupytext.toml
notebook_metadata_filter = "all,-widgets,-varInspector"

# Pair .qmd, .py versions in folders to .ipynb in main folder
formats = "ipynb,_notebooks_py//py:percent,_notebooks//qmd"

You can see the format at the end is a little different, but this is how I was able to get the results I wanted.

Who would not want to use Jupytext?

I really enjoy using Jupytext, and I hope you will try it, but it’s not for everyone or every situation.

To start, you can see that none of the output is included in the synced Jupytext files. All that information still exists in the .ipynb files, but not in the .py files. Not even text-based output. I am mostly fine with this. This behavior is similar to that of .rmd files, so it must work for R users.

If you rely on noticing changes in plots when re-running notebooks to detect something has changed, you will need some other method. I like to save modeling fits to files anyway (and version control them), which will notify me if something has unexpectedly changed.

If you are collaborating with others and don’t have tightly adhered requirements, your collaborator might re-run your notebook and not realize that they’re seeing a different plot than you are. Again, generating and versioning text file artifacts can help here.

It’s really nice to read through a tutorial for a Python package on its website, and download its original .ipynb file. I always re-run the files anyway, but it would be less useful if the notebook were devoid of all output.

Jupytext alternatives

I believe nbdime is the most popular tool to work with Jupyter notebooks and version control. You can download and install it yourself, but I believe it is incorporated in notebook diff comparisons in VS Code and GitHub. Although Jupytext is another tool to rely on, I like converting to plain text files ahead of time, so I can just use Git or GitHub and not have to worry about fussing with a rich, graphical diff of a file.

I hope you give Jupytext a try and contribute to the Jupytext community if there is a special case that is not covered yet.